Log in

No account? Create an account

Jul. 5th, 2013

Division on a Microcontroller

I was using an AVR microcontroller to learn assembly language and encountered a neat trick for doing a division by a constant.

The processor I'm using is an atmega328p which is an 8 bit processor and while it does not have a divide instruction, it has a multiply instruction that is quite fast (2 clock cycles).

In this example I will be writing a short assembly routine which will divide an 8-bit number by 10.

Things to keep in mind

A useful identity which will help us in this case is the fact that a / b = a * (1 / b). What this means is that a division is equivalent to a multiplication by the reciprocal of the divisor (ex. 5 / 10 = 5 * 0.1)

Another issue with the processor is that it can only handle integers. There is no support for floating point operations. However this can be remedied by using fixed point arithmetic. In fixed point arithmetic you place an implicit decimal point between two particular bits. As an example of this, imagine having to work with currency values. Instead of having your calculations done in euros, you calculate everything in cents. This way you will only have whole numbers to work with. The important thing is that whatever the current value, you always know that if you just write a decimal point between the 2nd and 3rd digits from the right, you will have that value in euros again. Example: €1.5 + €2.3 = 150c + 230c. The result for the 2nd expression is 380c. If I imagine a decimal point between the 2nd and 3rd digits, I get the value in euros instead of cents (€3.80)

Addition and subtraction are very easy to do in fixed point arithmetic. You just add or subtract the values together, and everything else will fall into place. Multiplication is a bit trickier, but not by much. Let's say I have two numbers X = 2.45 and Y = 3.5 and I multiply them together. Their answer, which I'll call Z is 8.575. An observation about this is where the decimal point ends up. It's actually pretty simple. X has 2 digits after the decimal place, while Y has 1 digit after the decimal place. Z has 3 digits after the decimal point. I told you it's not really that complicated! So when we do a multiplication we just have to keep in mind that the result will have as many digits in the fractional part, as the original numbers had combined.

Approximating 0.1

The first thing to do is to convert 0.1 to binary. I won't go into detail how this is done here but there's a pretty good explanation over here. We run into yet another problem pretty soon. 0.1 cannot be accurately represented in binary! it is converted to 0.00011001100110011001100...

But we don't need to represent 0.1 entirely accurately (which is impossible). As stated earlier, the processor handles 8 bit values, therefore it can only handle numbers from 0 to 255. Therefore we only need our representation of 0.1 to be accurate enough to give good results for this range of numbers. In effect we won't be calculating x / 10, but we will only be approximating it. But approximating it in a way that happens to give the correct result for all values from 0 to 255.

Now the way integer division works is that it will truncate the fractional part. There is no rounding. In integer arithmetic 99 / 100 = 0 because the fractional part (0.99 in this case) is discarded. What this means is that my approximation of 0.1 must be larger than or equal than 0.1. If I choose a value which is slightly smaller, I will get results which are smaller than they should be. Consider 10 / 10: If we calculate this using an approximation that is slightly smaller than 0.1 then the result will definitely be smaller than 1 (since we're adding less than 0.1 for 10 times). After discarding the fractional part we're left with the result 0 which is wrong. It should be 1.

With this revelation I can deduce another thing. Since 0.1 cannot be represented accurately in binary, each additional bit I calculate, will get me slightly closer to 0.1 but never actually get me to 0.1. Therefore I will definitely have to round my approximation of this number up at some point. With this other revelation I can deduce that I only need to check positions where there is a 0 bit immediately followed by a 1 bit. If I round up at these positions, I will get a value slightly larger than 0.1. 01 will round up to 10. (Example: The first four bits of the representation will be 0.0001, which has a value of 0.0625 which is less than my target value. If I round that up I get 0.0010 which has a value of 0.125, which is slightly larger.) If I round up wherever I encounter 11, the result will be equivalent to one which I encountered earlier (there will always exist an 01 before a 11). 00 and 10 have nothing to round since the second bit is 0

So we're making progress now. We know that the calculated value for 0.1 must be slightly larger than 0.1. But how will we know if it's precise enough? Another important question is how will I know that my value is not too precise (which would mean I'm using more bits than I need)? To do this I will measure the Error of my approximation. This is actually really simple. I just take the difference of my approximation and the target value. If I take the example in the above paragraph, there we arrived at a possible approximation of 0.125. Therefore the Error in this case is 0.125 - 0.1 = 0.025. Now how will I know if this value will work for all the possible values (0 to 255)?

In order to reason this out it helps to think of a multiplication as a repeated addition. To keep the example simple, let's assume I am creating an approximation of the number 3, which I will use to multiply with the number 4. Therefore I want to calculate 3 * 4. Because multiplication is really just repeated addition, I can write it as 3 + 3 + 3 + 3 (3 * 4 means 3 for 4 times). Now let's imagine that when I was creating an approximation of the number 3 and I have arrived at the number 3.1. As we've seen in the above paragraph, this has an Error of 0.1. Let's do the multiplication we mentioned here with this approximation, but we'll do the additions one by one.

3.1                   = 3.1
3.1 + 3.1             = 6.2
3.1 + 3.1 + 3.1       = 9.3
3.1 + 3.1 + 3.1 + 3.1 = 12.4

Now keep in mind that we originally wanted to multiply by 3 not 3.1. As you can see the first line is off by 0.1, since the answer was supposed to be 3. The second answer is off by 0.2. The third is off by 0.3, etc. What I wanted to demonstrate with this is that the Error will accumulate the more "additions" you make (Again I'll repeat, multiplication is the same as repeated addition). In this example I chose these numbers as it is easier to demonstrate. Now let's get back to our original problem.

So now we've learned that the error accumulates. Previously we've arrived at an approximation of 0.125 which has an Error of 0.025. What is the largest number of additions that we can do? The answer for that is 255, since that is the largest number we can work with. So we can multiply the Error by 255 to find out the largest error that can arise in any of our calculations. The answer we get is 6.375. This is definitely not good for us. Since the total error is larger than our target value of 0.1, that would mean that If I were to add my approximation together for 255 times, it would be like I added 0.1 an extra 63 times! I only want to add it for 255 times. So the maximum accumulated error must definitely be smaller than my target value of 0.1

So now we know for sure that we need to be more precise. Let's see what other positions we need to check:

    ^   ^   ^   ^   ^

We need to check the marked positions, as they are positions that will allow us to round the value up. We've already worked through the first one and discovered that it's not accurate enough. I'll just work out the other positions until I find a suitable one

BINARY           VALUE            ERROR            ACCUMULATED ERROR       GOOD ENOUGH?
0.001            0.125            0.025            6.375                   no
0.0001101        0.1015625        0.0015625        0.3984375               no
0.00011001101    0.10009765625    0.00009765625    0.02490234375           yes

So now we've managed to find an approximation, 00011001101, of 0.1 which is good enough to work for all values from 0 to 255, because if we add it for 255 times (equivalent to multiplying it by 255), the total error is less than 0.1. This means that If I add my "0.1" for 255 times, I can be certain that I have only added for exactly 255 times. Sure the fractional part will be incorrect... but we're going to be discarding it anyway, so we don't care if it is incorrect.

Now what?

Now the fun stuff begins. I need to implement a division by 10 on my processor using this value I just calculated. Here I encounter yet another problem. The value I calculated has 11 bits (You don't count the 0 before the point) but my processor can only handle 8 bit values! Remember that I will have no decimal point in there (since we will be working in fixed point arithmetic). Let's temporarily represent that number as a 16 bit number instead of 8 bit. We get 00000000 11001101. The upper byte is all 0! And since we will treat it as a whole number instead of a fraction, leading 0s mean absolutely nothing. So we don't need to store them. We can store the lower 8 bits (which will fit in our processor) only. However it is very important to remember that there were 11 digits after the decimal point, the reason for which will become apparent shortly. So after all this, our approximation of 0.1 is finally something which we can plug into the processor: The byte 11001101

Now let's work an example because things would be clearer this way. Suppose I have the number 42, and I want to divide it by 10. The number 42 in binary is 00101010. Now I have my 2 bytes that I can process using my processor.

42  = 00101010
0.1 = 11001101

If I multiply those two together I get the following 16-bit binary number

00100001 10100010

Now we just have to get our final answer out of that. But how will we do that? Earlier in this post I wrote about multiplication in fixed point. Merely the observation that the result will have as many fractional digits as the original numbers had, combined. So let's see how many fractional digits there are in our values in this example. The number 42 (00101010) has 0 fractional digits. We just converted 42 to 42. Simple so far. Our approximation of 0.1 (11001101) has 11 fractional digits (even though we don't see the decimal point, I told you to remember this fact about it). So that means that our result will have 0 + 11 = 11 fractional digits. Let's see the result once again and this time I will put the decimal point where it should be

00100.001 10100010

Now since this is integer division, we should discard the fractional part which would leave us with:


If we convert that back to decimal we get 4, which is the correct answer for 42/10.

One more step...

There is one final issue left, which is very simple to solve. Basically we have to somehow tell the processor to discard the fractional part. This is actually very simple, and we will accomplish this by the use of bit shifts. A right bit shift is a very simple operation. It will simply move all bits one position to the right, and discard the rightmost bit (which has nowhere to go). We need to remove 11 digits from the right of our value (since the fractional part has 11 digits) so we shift right 11 times. Let's see this in action:

    00100001 10100010
1:  00010000 11010001
2:  00001000 01101000
3:  00000100 00110100
4:  00000010 00011010
5:  00000001 00001101
6:  00000000 10000110
7:  00000000 01000011
8:  00000000 00100001
9:  00000000 00010000
10: 00000000 00001000
11: 00000000 00000100

As you can see after 11 bit shifts to the right, we end up with the number 4, which is our final answer.

Now in our processor, 11 bit shifts will take 11 instructions. But there is a simple way to reduce them. As I said earlier, the processor can only work with 8 bit values. So when it works out the multiplication, it does not give us a 16-bit result. Instead it gives us two bytes which together form the 16 bit result. This is convenient for us, since we can save 8 shifts by simply ignoring the lower byte. Here's how it will work:

00100001 10100010 (complete result)
00100001          (we ignore the lower byte)
00010000          (shift right)
00001000          (shift right)
00000100          (shift right)

And as you can see we still ended up with the number 4 (in a convenient single byte) at the end.

The code

After all this trouble, here's the assembly code I wrote to do all this:

LDI R16, 42          ;Load 42 into register R16
LDI R17, 0b11001101  ;Load our approximation of 0.1 into R17
MUL R16, R17         ;Multiply R16 and R17.
                     ;This puts the low byte and high byte
                     ;of the result in R0 and R1 respectively
LSR R1               ;three right shifts of the high byte
                     ;The final answer is in R1

Why all this trouble?

The reason we went to all this trouble is that division is slowwwwww. The manufacturer of this microcontroller has provided some generic multiply and divide routines that can be used with it. They can be found here. If you look at the 8x8 unsigned division algorithm in there, it states that that algorithm takes 58 clock cycles to complete. In my case I have sacrificed flexibility (my algorithm can only divide by 10!) for performance. The assembly code written above runs in 7 clock cycles, which is slightly over 8 times faster. There is another trick I could use to make it execute in 6 clock cycles instead of 7, but that is beyond the scope of this post.

I hope you found this post interesting, and I wish you luck if you ever decide to get into assembly programming. It's not easy, but it's actually quite fun once you start getting the hang of it.

Sep. 16th, 2012

Poker hand java comparator

I recently tried to solve Project Euler problem #54 which gives you a list of 1000 poker games and asks you how many games player 1 won.

I implemented a class Hand which describes a poker hand and implements the interface Comparable. This means that you can even have an array of poker hands and you can sort them from worst to best easily using java's own Arrays.sort method.

The internal class "Card" used to represent a card also implements the Comparable interface but its compareTo method returns a reversed output (so the Arrays.sort method will sort cards in descending order).

The Hand class' constructor accepts 5 strings which describe each card. The string should be two characters long with the first character representing the card's rank and the 2nd the card's suit. The possible values for the rank are (in ascending order, ace being the highest ranked card): "23456789TJQKA", while the 2nd character should be one of "HDCS" (the first letter of the suit's name).

Once you construct a hand you can easily see which one wins by using the compareTo method. An example is below:

Hand h1 = new Hand("","","","","");
Hand h2 = new Hand("","","","","");

int result = h1.compareTo(h2);
	System.out.println("Player 1 has won");
} else if (result == -1)
	System.out.println("Player 2 has won");
} else
	System.out.println("It's a tie");

The source code for the class can be found here.

Apr. 7th, 2012

Cheating at Draw Something

I have managed to discover a way to cheat at the popular game Draw Something.

The way the game works is that it tells you the number of letters in the word. For this example let's assume that the word contains 5 letters.

It also gives you a set of 12 letters that you can use to guess the word. 5 of those letters you will need, and the rest are given to you at random. My method takes advantage of that fact.
Since 5 of those letters will always be there I can reduce the game to a simple anagram.

The first set of letters I got was: AHKKLMNOOPRW
I quit the game, and restarted it and it gave me a different set of letters: AHHKKNOPQRRW

I checked what was common between them and the result was: AHKKNOPRW
Since the result is more than 5 letters, I can simply get another set and check to see what's common between those to eliminate more letters.

The next set was: ABHHKNOPRSZY
and the next result was AHKNOPR

The next set was: AEKLNOPPRSUY
ans the result was: AKNOPR

The next set i got after restarting the game was: ABDFGMNOPPRW
and the following letters were common between this set and the previous result: ANOPR

I now have the 5 letters which were common to all 5 sets of letters given to me by the game. Naturally I wrote a program to automate the above process.
Now since the word I want is 5 letters long, and I have 5 letters, the game is now reduced to a simple anagram.
Next I wrote a program that opens a dictionary (I used openoffice.org's dictionary for this) and looks for all 5-letter words containing only these letters and prints them out for me.

The program's output was 2 words: PARON and APRON

I now managed to reduce the search space from 95040 possibilities (all possible permutations of 5 items from 12) to 2 possibilities. I tried APRON first and that was it. Yay!

Jul. 31st, 2011

HTML Minesweeper

I recently wanted to write my own version of minesweeper... so I did.

All it took was some html, css and javascript...

You can find the code here

Mar. 10th, 2011

A useful piece of code

Here's a handy little class I wrote which is useful for when you need a huge boolean array in Java (as I sometimes needed one that wouldn't have fit on the RAM on my computer had it been a boolean[] array). In this implementation a boolean variable takes up only 1 bit of space as opposed to 8 bits like a boolean variable. Internally a multiple of 64 bits are allocated, so if the number passed to it in the constructor is not a multiple of 64, the total capacity will actually be the next multiple of 64 up. It does not check for bounds or anything, but it gets the job done well. The intended capacity should be passed to it in the constructor. The set method sets the desired bit to 1, the clear method sets the desired bit to 0, the toggle method will flip the desired bit's value and the get method gets the desired bit's value. Here it is:

public final class BitArray {
  public BitArray(long values) {
    this.values=new long[((int)(values>>6))+((values&0x3F)!=0?1:0)];
  final void set(long pos) {
  final void clear(long pos) {
  final void toggle(long pos) {
  final boolean get(long pos) {
    return (values[(int)pos>>6]&(1L<<(pos&0x3F)))!=0;

Perfect Shuffle

I Recently found the following problem online:

Given a deck of nCards unique cards, cut the deck iCut cards from top and perform a perfect shuffle. A perfect shuffle begins by putting down the bottom card from the top portion of the deck followed by the bottom card from the bottom portion of the deck followed by the next card from the top portion, etc., alternating cards until one portion is used up. The remaining cards go on top. The problem is to find the number of perfect shuffles required to return the deck to its original order.

It then asked what the result would be if nCards=1002 and iCut=101

I thought it was quite an interesting problem and decided to try and solve it myself.

The simplest and most obvious solution is to use a brute force method by repeatedly shuffling the deck of cards until it reverts back to its original order. The deck was represented by a list of numbers which were sorted in ascending order for the initial state. The list was then shuffled repeatedly using the method described above and a counter was maintained until the list became sorted correctly again (which means that it returned to its initial state).

The algorithm worked and did return a correct value for small lists. However as the list grew larger the possibilities grew much larger as well, making this algorithm unfeasibly slow. A new method of determining this value was thus required.

I decided that I should track the movements of individual cards through the deck as it was being shuffled separately. Since a deck will always eventually return to its initial state, it stands to reason that a single card will eventually return to its initial position.

The first thing I did was perform the brute force algorithm on a small list by hand on a piece of paper. I chose nCards = 7 and iCut=2 since the first program came up with a value of 12 shuffles required, which was small enough to do on paper. I then used this piece of paper to analyze the way things were shuffled, so that I can come up with some way of determining the new position of a single card given the old position, based on nCards and iCut. Here is the list I came up with (The first line is the initial state, and the rest are the state of the deck after each shuffle):


For the sake of conciseness I will use the following variable names in the explanation below:

  • nCards = The size of the whole deck
  • iCut = The position at which the deck will be cut. Incidentally this also happens to be the number of cards in the top stack after cutting it.
  • ST = The numebr of cards in the top stack. This has the same value as iCut
  • SB = The number of cards in the bottom stack. This can be calculated as SB = nCards - iCut;
  • DIFF - The difference in the number of cards of the top and bottom stacks. This can be calculated as DIFF = abs(ST - SB)
  • POS - Will refer to the position of a particular card. A position will always be relative to the whole deck. A position is also 0 based, beaning that 0 is the position of the first card in the deck.

I will refer to the entire deck as "deck" and I will call the separete decks created after splitting the deck "stack" to avoid confusion.

The positions are zero-based meaning that they will range from 0 to nCards-1.

The top stack will be from 0 to iCut-1, while the bottom stack will be from iCut to nCards-1.

An observation I made is that the first DIFF cards of the larger of the two stacks will simply be placed on top of the shuffled deck. This means that if the top stack is the larger one, then the position will remain unchanged or else if the bottom stack is the larger one, the card will maintain the same index as in the bottom stack, except it will be the index of it in the whole deck. Example: If the bottom stack is larger than the first stack by 4 cards, then the first 4 cards, indexed 0 to 3, will simply be moved to the top. This means that ,for example, the second card in the bottom stack will now become the second card in the whole deck. Therefore in this case we can simply subtract the position by ST (making it as if the top stack wasn't there) to convert it correctly.

These were the simplest two cases, where the cards weren't actually shuffled, but either remained, stationary or copied, in order, to the top of the deck. The other cases were slightly more tricky. However I did manage to find a way to map the position to a new one.

I decided to find a formula that would relate the distance of the card from the bottom of the stack it is currently in, to the distance from the end of the whole deck. The bottom card of a stack will have a distance of 0 (since it is 0 cards away from the bottom, as it IS the bottom card), the second to last will have a distance of 1, etc. Since I will be taking a card from the top and bottom stacks alternately I figured that there had to be a multiplication by two in there somewhere.

For the top stack this happened to be exactly the case. The formula being the distance from the bottom of the stack multiplied by two. The formula for the distance was found to be: ST - 1 - POS and when multiplied by two it gives the formula (ST-1-POS)*2. This gives us the distance from the end of the deck. To correctly map this distance into an actual position, we now have to subtract this distance from the position of the last card in the deck. The position of the last card is (nCards-1), therefore the final formula in this case is nCards-1-((ST-1-POS)*2)

The formula for relating the distance from the end of the bottom stack to the distance from the end of the deck is quite similar. First we have the extra step of converting the position from relative to the whole deck, to relative to the bottom stack (This step was unnecessary for the top stack as both the top stack and the deck start at position 0). The bottom stack starts at position iCut and since we need a zero-based position we simply have to subtract iCut from it. So we substitute the new corrected position into the formula, as well as change ST to SB (since we now need the size of the bottom stack, since that's the one we're working on). Also, since the method of shuffling describes that the first card on the bottom of the deck will come from the top stack, the distance then needs to be subtracted from the position of the second to last card of the deck, which is nCards-2. Therefore the final formula is nCards-2-((SB-1-(POS-iCut))*2), and to eliminate a pesky set of brackets I can write this as nCards-2-((SB-1-POS+iCut)*2)

Using the above I now have a way of finding out how many shuffles are needed to return a single card back to its initial position, by updating the position repeatedly according to the above rules, until it becomes the same one that it started as, and counting the number of iterations that this process took. This is faster than shuffling the entire deck and is much shorter since a single card may return to its initial position multiple times during the entire process, and this way we only need to analyze this once. This means that we are finding the length of a cycle. For example in the shuffle I did on paper shown above, it can be seen that 0 starts at position 0, then moves to position 4, then to position 2 and then back to 0. Therefore the number 0 will always be back in position 0 every 3 shuffles.
Similarly 1 moves along this cycle: 1 -> 6 -> 5 -> 3 -> 1. Also since it is a cycle, if you start on any one of that cycle's visited positions, the cycle length will always be the same for any one of those positions, since it will follow the same cycle. We only need to find a cycle once, so to further reduce the number of checks that we need to perform, I created an aray of boolean values which would signify which positions have already been visited during the search of a cycle. Therefore if a position has been visited, there is no need to start a search from a cycle from it again. As you'll see in the next paragraph, this won't affect the output as we only need a set of distinct cycle lengths to correctly compute the total number of shuffles needed. We can also simplify the process a bit to count the cycle length by updating the position until we reach a position that has already been visited.

Now we need a way of finding out how many shuffles it will take for the entire deck to return to its initial state. This means that we need enough shuffles for all the cards to reach their initial position simultaneously. This can simply be calculated by finding the Lowest Common Multiple of all the cycle lengths. This will give us the correct value. Here's why it works:

In the example above there are two cycles that can be found, one with length 3 and the other with length 4. 3 and 4's lowest common multiple is 12. And this is in fact the number of shuffles needed to return the deck to its initial state.

There are two cycles in it:
A: 1 -> 6 -> 5 -> 3 ->
B: 0 -> 4 -> 2 ->

This means that 0, 4 and 2 will return to their starting position once every 3 shuffles, while 1, 6, 5 and 3 will return to their initial position once every 4 shuffles. In a complete shuffle back to the initial state, all cycles have to be completed (so the cards return to where they began) so the lowest common denominator is a useful number here. In this case the LCM is 12, and it works. Cycle A will be completed exactly 3 times after 12 shuffles, while cycle B will be completed exactly 4 times. Since all cycles are completed exactly in 12 shuffles, we can say that all cards have returned to their original position, and therefore the deck is back to its initial state.

I coded the program using Processing and fed it nCards=1002 and iCut=101 to find the number of shuffles required by the question and got my answer as 5812104600.

My code can be found here

Oct. 19th, 2010

Pi by random numbers

An interesting (albeit very inefficient) way to estimate pi is to use random numbers.

Imagine a square with a side of length 2, positioned such that its diagonals intersect at coordinate (0,0) (which is just a fancy way of saying that its center lies exactly at (0,0))
Now imagine a circle with its center at (0,0) and having a radius of 1. Now we have a circle that touches all 4 sides of the square exactly. This is the largest circle that can fit inside the square we have.

Now we know that the ratio of a circle and a square circumscribing it is pi/4... just look it up... Google is your friend.

Now imagine the square with the circle inside it on a wall and you blindly throwing some darts at it. Some of them will lie on the circle, while some of them will not. Assuming that each point on the square has an equal probability of being hit, the ratio of those that hit the circle to all the darts thrown (assuming that all of them hit somewhere in the square) is also approximately pi/4. Therefore if we take the ratio and multiply it by 4, we get pi.

Now how will we know if the dart has hit the circle or not? This is quite simple. If the circle is centered around (0,0) and has a radius of 1, that means that any point that is more than 1 unit away from (0,0) is not in the circle. We can find the distance to the origin by imagining a right angled triangle with its hypotenuse lying between where the dart hit to the origin. That way the other 2 sides of it will be of length x and y (x and y being the position of the dart). Now if we use Pythagoras' theorem (a^2 + b^2 = c^2) we can calculate the length of the hypotenuse of the triangle and thus find the distance between the dart and the origin (because the hypotenuse of our triangle will be a straight line between the dart and the origin)by simply calculating sqrt(x^2+y^2). If the resulting value is smaller than 1, that means that the dart is inside our circle.

With this in mind we can write a program that generates a certain amount of pairs of random values from -1 to 1 (we need a pair because we need an x and a y coordinate for each dart) and keep track of all the ones that hit inside the circle. After we do this we simply take the number of darts that hit the circle, divide it by the total number of darts thrown, and finally multiply the result by 4 to get an approximation of pi.

The program to do this is very short so here it is:
double darts=1000000;
double insideCircle=0;
for(int i = 0; i < darts; i++)
    double x = Math.random()*2-1;
    double y = Math.random()*2-1;
    double distance = Math.sqrt((x*x)+(y*y));
    if (distance < 1)

When I run this program here, it spits out 3.141452
It's an inefficient way to estimate pi, after all it took a million iterations to estimate it to just 3 decimal places correctly (pi is 3.14159265... so you can compare the output with it), but it is an interesting way nonetheless because it uses random numbers to calculate it.

Oct. 15th, 2010

Finding Primes

So I was bored at home one day and decided to try to find something to do. Out of nowhere the phrase "prime numbers" popped up in my head and I decided to try to find as many as I can.

To do this i implemented a simple Sieve of Eratosthenes which is an ancient algorithm for finding primes up to a finite upper limit. The algorithm is as follows:

  1. Create a list of integers from 2 to n

  2. Let p be 2, the first prime number

  3. Mark all multiples of that number as not prime

  4. Find the next number in the list that is still marked as prime, and set p to be this number

  5. Repeat steps 3 and 4 until the square of p is larger than n

  6. The numbers remaining on the list marked as prime are all the prime numbers smaller or equal to n

Since all that i need to know about the numbers is whether they are prime or not, I decided not to store the numbers themselves in the list, but a boolean value indicating whether it is prime or not. Since in Java an array of boolean values is initialized with every element being false, I decided that false will mean it is a prime number and true meaning that it is a composite number. Having decided this I set to work and implemented the first version. It had an array of boolean values and implemented the algorithm as described above. It worked perfectly and successfully found the primes.

However, I was not happy with it yet... i wanted to make it faster. All composite numbers have at least one prime factor that is smaller than or equal to their square root. Knowing that, it means that,for example, having processed values of p up to 5, all the composite numbers smaller than 6^2 have been already determined to not be prime and so they don't need to be checked again. Another way to think about this is that since all composite numbers have a prime factor smaller or equal to their square root, all values under, for example, 6^2, have a square root less than 6, so by checking values up to 5 we have proven them to be either prime or composite. Knowing this i speeded things up a little in my program. When checking the multiples of p, the marking can start at p^2, since all the values before that need not be checked again. This means less checks have to be done for each value of p, e.g. in the first version when checking, for example, the number 5, the program would mark 10, 15, 20, 25, etc. In the new version, the marking starts at 25 (5^2) because all the numbers smaller than that have already been checked well enough, so in this case 10,15 and 20 need not be marked again. Sure, three numbers may not seem like a lot for now, but the amount of numbers skipped increases as p grows larger. After checking the outputs from both versions, they were identical so the program still found the correct primes as it was supposed to.

This approach has a small problem though, it needs to keep track of all the numbers from to up to n, which means that there is a limit on how many numbers you can check depending on how much memory your machine has. I wanted to increase the number of values that I can check. Since memory in a pc is byte granular (meaning the smallest piece you can access is a byte) a boolean value in Java still takes a byte of memory. Since a boolean value can either be true or false only 1 bit is needed to represent it, therefore for each boolean value that I store, I'm wasting 7 bits of memory. Therefore I wanted to create a new data structure that uses only 1 bit of memory to store a boolean value. I decided to store the values as an array of long values each consisting of 64 bits and use individual bits in each long value to store 64 boolean values in a long variable. It was quite simple to implement it. Given a position in which to store the bit, you divide it by 64 to get the position of the long value in the array into which it should go, and use the remainder to determine which bit in that value will store the value. A long value is created that has a 1 in that position and it is ORed with the stored long value to turn that bit on. To read the value the same procedure is done, except an AND is performed at the end instead of an OR, and if the resulting value is not 0, then that means that the bit was on.

When I tried it out it still found the primes correctly but it ran quite slower. On further inspection it appeared that the program slowed down because of the code that was required to find the position of the bit to store the value in. It used a division and a modulo operator to calculate the required positions. Since those are both expensive operations, and that code was being run constantly, it slowed things down quite significantly. This was fixed by using bitwise operations. Since the modulo of 64 can be from 0 to 63, it will always be in the rightmost six bits of the position (as the 7th bit means 64, which is larger than 63). Also since 64 is a power of 2, the division can be performed using a bit-shift operation which is much faster. Because of this property, the division by 64 can be performed by a simple shift of 6 positions to the right (since shifting one position to the right means a division by 2, therefore a shift of 6 positions means a division by 2^6 = 64). The modulo of 64 could be computed by ANDing the position with 0x3F (binary 00111111) which would clear all bits in the position except for the rightmost 6 bits, which would leave us with the modulo 64 of the position number. Therefore if the position is n, we need to access the (n>>6)th position in the long array, and OR it with (1<<(n&0x3F)) to turn that bit on (or AND it if we want to check if that bit is on or not). When the program was run again, it still found the prime numbers correctly and it ran much faster. In order to gain a further tiny bit of speed I decided to write the code to calculate the positions inline with the program to save the time it takes to call a function (I had implemented the bit array in a class before). This doesn't save much time, but it is slightly helpful in shaving off a few seconds when checking out a very large range of numbers. It was still very slightly slower than the first version of the program that I wrote, but it's a compromise I'd have to make in order to gain the higher space efficiency.

On the first version I tested numbers up to around 2 billion (using slightly more than 2GB of RAM). On the latest version I could test numbers up to 8 times that many (since a bit has 8 bytes, the original version wasted 7 bits out of those, and the final version utilizes all of the bits). That means that I could now test all the numbers up to around 16 billion. Could I fit any more in there? Sure!

Up till now I was storing a bit for each number from 2 to 16 billion. However, since all even numbers (except 2) are not prime numbers, that means I was still wasting half the memory storing bits representing even numbers, when I can be sure that none of them will ever be prime. I modified the program to only store bits representing odd values and therefore I doubled the amount of numbers that I can test yet again. This means that I can now test numbers up to around 32 billion. First of all in the algorithm I had p start at 3 instead of 2, since all even numbers were already eliminated from the list. After that when the next number to test was being searched for, p was incremented by 2 instead of 1 each time (therefore ensuring that p always remains odd, since odd + even = odd). The square of an odd number is always odd as well, so there was no problem there (remember that when we mark out multiples, we're always starting at the square of the number... which has to be odd now since we can't access any even numbers). Also when marking out multiples the program started out at p^2, and then always added p to that to get to the next position. Now since p is always odd and adding an odd number to an odd number results in an even number, there was a bit of a problem since we cannot access even numbers now. However while odd+odd = even resulting in our problem, even + odd = odd. Therefore this means that we cannot access every other multiple of p since it is even. However by adding 2p each time we will always arrive at an odd number (since starting with an odd number: odd+odd = even, even + odd = odd) by skipping every other multiple of p. This also means that we halve the number of checks that we need to make, therefore making the program faster.

Now that we ensured that we will always be using odd numbers, we can make the program only store the odd numbers and not the even ones. While previously, the position of the bit that represented the number resided in the position denoted by that number itself, all that we had to do now was to take the number, subtract one and halve it and we will get the position of the bit representing the number in the bit array. We subtract one to make the number into an even number (since halving an odd number does not give us an integer) and dividing it by 2 to halve it (since we only want to store half of the numbers since we discarded the even ones). Now the number 3 is stored in bit 1, 5 into 2, 7 into 3 etc... Bit 0 is never accessed by the program so we're "wasting" it. This could be remedied by subtracting 1 from each position number after it is calculated, but I decided against it so as to reduce an extra subtraction every time we wanted to find a position, which is something that will be calculated very frequently in the program (it's a good tradeoff... 1 bit for slightly more speed... and besides, bit 0 can be used to mean the number 2, which is the first prime number). This whole subtracting one and halving the number can be accomplished using a single operation by simply right shifting the number by 1. This is because all odd numbers have the rightmost bit set to 1 when represented in binary, and subtracting 1 simply cleares it. As we said earlier, dividing by 2 is the same as shifting to the right one position. In a processor doing integer math, shifting to the right discards any bits that move beyond the first position, which in this case is the rightmost bit only since we're shifting by one, which would have been cleared anyway had we subtracted one first. So taking an odd number n and calculating n>>1 for it is the same as first subtracting one and then halving it, so we can accomplish this quite easily by using a single operation, which is much faster than subtracting one and dividing by 2.

Running this program, I checked that it was still working correctly, which it did. After the sieve algorithm is finished, it is simply a matter of checking the bits in the array, and those bits that are still set as 0 represent a prime number. Therefore if position 3 is set to 0, that means that the number 3*2+1 is prime (we multiply by 2 and add 1 to get the original number, because we subtracted one and halved to go from the number to the position, therefore we do the inverse operation to go from the position to the number). Therefore since position 3 is still set to 0, that means that the number 7 is prime.

Also since we will find a large number of primes, it does not make sense to simply show them on screen (since they'll whizz by too fast to see anyway) so I also wrote some code that will save them into files into chunks of 1 million primes (file 0 will contain the first million primes, file 1 the second million, etc.). I ran the program to find all the primes smaller than 34358400000, which is as much as I could fit in the RAM on my laptop. It found about 1.48 billion primes in around 10 minutes (The program ran for about 35 minutes, but about 25 minutes of those were spent saving them to disk). They were stored in text files with 1 prime number per line... they took up 16GB of hard disk space :D

Will I use them for anything? Probably not... but it was still a good exercise in programming :D

The program was implemented in Processing. The source code for the above program can be found here.

Jul. 19th, 2010

Maze Generator

I recently took an interest in maze generation and wanted to give it a try. I read about 3 methods for generating a perfect maze (perfect meaning that there are no loops and there is only one unique solution between any two points in the maze) which are a Depth First Search, Kruskal's algorithm and Prim's algorithm.

A perfect maze is basically a tree structure, and in fact all three of those algorithms generate a spanning tree on a graph. In this post I will be talking about my implementation of Kruskal's algorithm (although i actually implemented all three, but I found Kruskal's to be the most interesting one).

As I said, kruskal's algorithm is used to create a spanning tree over a set of points, and that's exactly how we'll be treating our maze. The maze will consist of a rectangular array of cells, with each cell having a wall on each of its four sides. Each cell will be a point in the "graph" and our program will knock down walls in a way such that a unique path exists between any two cells in the grid.

The algorithm makes use of a disjoint-set data structure, so that each cell can form part of a set of others. In the beginning each cell is in its own set (ie. a set with only 1 element). So our program is initialized that way (after setting up the cells and putting up all the walls). A list of all the walls is also kept.

The algorithm then picks a random wall from the list and removes it. It then checks whether the two cells that it separates are in the same set. If they are not in the same set, then the wall is "broken down" therefore connecting the two cells, and the two sets to which both cells belonged are joined into one. If the two cells already belong to the same set, then nothing is done and the wall is simple removed from the list (while leaving it "built"). This process is repeated until all the cells are contained in one big set.

The best way to explain why this works is by an example. Suppose we want to generate a 2x2 maze (not really a difficult one but it's enough to demonstrate this concept). We have 4 cells called A, B, C and D (A and B are at the top, and C and D are at the bottom) and there are 4 walls between them. Since each wall is between two cells we'll refer to the walls by two letters denoting the cells they connect (ex. wall AB lies between cells A and B).

The maze is initialized and all the cells are in their own set. A list of all the walls is also kept. Suppose the first wall that was chosen at random was wall AB. Cells A and B are not in the same set so AB is broken down, connecting cells A and B, and the two sets are joined (therefore now we have 3 sets instead of 4, one containg cells A and B, one containing just cell C and another having just cell D). By joining the two sets, it means that there is now a path between all the cells in both sets.

The next random wall to be checked now happens to be CD. Cells C and D are not in the same set so the wall is broken down and both sets joined (so now we have two sets, one containing A and B, while the other contains C and D). Up till now we can travel between C and D, and between A and B, but we cannot reach the whole maze from any location yet.

Suppose the next wall chosen was BD. They're not in the same set so, again, the wall is broken down and the two sets joined. Now all the cells are contained in one set so the algorithm stops.

However, for the purpose of this example let's say we now try to remove AC. When we check the cells we now see that they already are in the same set so the wall is not broken down because this means we can travel between them already, even though the wall between them is still built. if we look at our maze right now we can see that this is true. We can travel from A to C by going through the other two cells (which are connected). Therefore to go from A to C we can follow the path A --> B --> D --> C.

While we obviously wouldn't want to generate a 2x2 maze in practice, the concept works on any number of cells and the end result is always a perfect maze, with no loops and only one unique path between any two points in the entire maze.

I wanted to try this out so I wrote a program to do this. I implemented it in Processing which is based on Java. I created a class called Cell which stores information about each cell such as which walls surround it, whether it is a start or an end cell (I then wrote a method which finds a path between the start and the end points... but I'll talk about that later) and whether it is a part of the solution (ie. it forms part of the path between the start and the end points. Again, I'll talk about this later.) The class also has a variable in which to store a reference to which set it belongs to (which I called EquivalenceClass in the program) as well as a reference to the next cell in the set (I will explain why I needed these in a bit)

Afterwards I created a Wall class which stored a boolean variable indicating whether it was still standing or not, and a reference to the two cells that it separates.

The maze was represented by a 2D array of Cell. I initially stored the list of Walls in an ArrayList. When the cell array was initialized, Cell objects were created and placed in the array and the walls created and referenced by the Cells (and the cells referenced in the Wall objects), while the Wall objects were also added to the ArrayList.

The first performance improvement I thought of was getting rid of the ArrayList. During the main loop of the generation algorithm, I was repeatedly removing a wall from a random position in the ArrayList. This proved to be a big a hit in performance, because when an item is removed from the ArrayList, each subsequent item has to be shifted one position upwards to fill the gap. I fixed this by storing all the Walls in an ordinary fixed-size array and manually keeping an index starting from 0. The array of Walls is shuffled after it is populated, and then accessing them sequentially (by incrementing the index variable every time one of them is used). Since the array was shuffled beforehand, sequential access to the array will give us the walls one by one randomly and in this way the use of the ArrayList was eliminated.

The sets were implemented as a linked list. I initially used Java's built in LinkedList class, but later decided to implement my own. This was to further improve performance. When adding any class in a LinkedList (the one provided with Java), it has to be encapsulated in another class which means that more objects have to be created. I improved this by putting the necessary fields that I needed in the Cell class (a reference to the next one in the list [or "set"]) and implementing a custom linked list. The cells also contain a reference to the EquivalenceClass they belong to. This serves our purpose in implementing the disjoint set data structure we need. This was all implemented in a class called EquivalenceClass and it stores a reference to the first and last cell of the linked list it implements.

The operations needed for a disjoint set operation are now in place. To create a set a new EquivalenceClass is created and a Cell is passed to it (it automatically sets the Cell to be its first and last element [since it is its only element for now] and sets the Cell's equivalenceClass field to point to it). To perform a find operation (to discover which equivalence class each cell belongs to) we have the equivalenceClass field in the Cell itself, and checking whether two cells form part of the same equivalence class is as simple as comparing their references to their equivalenceClass fields (using the == operator in java compares the pointer to the object, not the object itself... so if two cells are pointing to the same EquivalenceClass object, an == operation will indicate so). To join two sets together takes slightly longer because we have to update each cell in one of the sets to point to the other EquivalenceClass object (we don't create a new one, we just add all the elements of one of them to the other). After the references are updated, the sets are then joined simply by setting the last item in the 2nd list to point to the first item in the first list, and setting the first list's first item to be the first item in the second list. Because we have to manually update each Cell's reference to the new EquivalenceClass the smaller list was always appended to the larger one (therefore always having to update the least items). This means that the number of items in each set is also stored in the EquivalenceClass object. The number of sets is initially the same as the number of cells and this is stored in a variable. Whenever two sets are joined, that variable is decremented.

Using this structure we can generate the maze by picking the next element from the list of walls, checking if the two cells that it connects are in the same set and if not, set its standing variable to false and join their equivalence classes, and repeating the process until the number of sets reaches 1.

A solver was also implemented that finds a path between the start and end points (by default always put in the upper left and bottom right corners). Initially all the cells are considered (potentially) a part of the solution. It works by finding a dead end (a cell with only one exit pointing to a part of the solution) and marking it as not part of the solution and follows the path until it reaches a junction. If a junction has had all but one of its exits marked as not part of the solution, it is itself considered to be a dead end. The start and end points are never considered to be dead ends. This process is repeated on every cell in the maze and ultimately, only the path between the end points remains highlighted.

Here is an applet I made that shows all three algorithms mentioned at the beginning of this post at work. The one on the left is a depth first search, the middle one is Prim's Algorithm and the one on the right is Kruskal's algorithm that was described here (although the code is an old version and quite slow... but enough to see an animation of how it works).

Here is the code for the final version that I described in this post.

Jul. 13th, 2010

Text to Brainfuck converter

Writing code to display text in Brainfuck is a bit of a pain (as is, well... everything else in the language), so I decided to write a program that will convert a string into the brainfuck code that reproduces it. I had already attempted this quite some time ago and there were two other versions before this one.

The first version would clear the current cell ([-]) set it to the desired value using a series of + and output the character with a "." It worked, but it produced a huge amount of code which was quite impractical to use (again... I doubt ANY brainfuck code can be considered practical... but still).

The second version attempted to reduce the amount of code by changing the value in the cell, instead of clearing it and setting it all over again. Therefore if the cell held a value of 'c' (99) and the next character was an 'a' (97) the program would output "--." to change a 'c' into an 'a'. Likewise if the next character was larger than the previous a series of "+" was used instead. This worked well for characters that were close together, but still had a huge chunk of "+" or "-" when the characters were far apart (such as when outputting a space after a character, or a lowercase character after an uppercase one).

I wrote the third version last night, and its aim was to further reduce the amount of code than was possible with the second version. The basic idea is the same as the second version, to modify the value of the cell from the previous value to the next, either increasing or decreasing it by a certain amount. However when it is possible, the program generates a loop to do this. When the difference between 2 characters is small, the output is the same as in version 2. However when making big jumps, a loop is generated to lessen the amount of code needed to reach that value.

This is done by finding the difference between the values (which is the amount we want to add or subtract from the cell) and then finding a pair of factors, and then performing a multiplication using the loop (by repeatedly adding or subtracting the value of the higher factor from the cell for "the smaller factor" times. (The looping was done for "the smaller factor" times so that the loop iterates as little times as possible, therefore slightly increasing the speed of the overall code) This means that an extra memory cell had to be used and the code generated looked something like this:

example of adding 30 to the current cell using factors 3 and 10: >+++[<++++++++++>-]<

This can work well for some numbers, however some numbers don't work quite as well. A good example of this are prime numbers. Suppose we wanted to add the number 31, instead of 30, which is a prime number. Since the only factors are 31 and 1 there would still be a sequence of 31 '+'s somewhere in there, however there is still a way to make it shorter. This is done by overshooting or undershooting the target value by a little bit, and correcting the difference after the loop. This way the code generated fro moving 31 spaces can be generated as ">+++[<++++++++++>-]<+", using the factors 3 and 10 to jump by 30 and then adding the final + to reach 31. This way the code is still shorter than just 31 "+" in a row.

The program implements this by calculating a bunch of pairs of factors and a value with which to correct an overshoot or undershoot, starting from a difference of 2 up to a difference of twice the original difference. It then checks the length of the generated code of each of the possible ones found and if one of them is smaller than the difference (ie. smaller, as we need it to be shorter than a series of X "+" in a row) it is picked. In the end the one with the shortest code is used and the others are discarded. If no shorter one is found then the program behaves like the second version and just outputs a sequence of "+" or "-" to reach the value.

Here is the final code. It was implemented in processing (just a quick sketch):

void setup()
String t="Text to brainfuck converter... by VrIgHtEr";
int prev=0;

for (int i = 0; i < t.length();i++)
char c = t.charAt(i);

void generate(int prev, int c)
int difference = c-prev;
if (difference==0) return;

boolean bigger=c>prev;
ArrayList factorPairs=new ArrayList();
int absDifference=abs(difference);
for (int i=2;i<absDifference*2;i++)
int rest=absDifference-i;
for(int j=2;j<=i/2;j++)
for (int k=j;k<=i/2;k++)
if((j*k)==i)factorPairs.add(new factorPair(j,k,rest));

int shortest=absDifference;
factorPair fShortest = null;
for (int i = 0; i < factorPairs.size();i++)
factorPair f = (factorPair)factorPairs.get(i);
int l = f.getLength();
shortest = l;

if (shortest==absDifference) outputNoLoop(absDifference,bigger);
else fShortest.output(bigger);

void outputNoLoop(int difference, boolean bigger)
for (int i = 0; i < difference;i++) print(bigger?'+':'-');

class factorPair
int f1, f2,rest;

factorPair(int f1, int f2, int rest)

int getLength()
return 7+f1+f2+abs(rest);

void output(boolean bigger)
for (int i = 0; i < f1;i++)print('+');
char r=bigger?'+':'-';
for(int i = 0; i < f2;i++)print(r);
rest = abs(rest);
for(int i=0;i<rest;i++)print(r);