Post

Problem 3 - Generating Prime Numbers

Generating prime factors is a task that comes often in the first 100 problems of Euler Project. One of the nice things of coding Euler project on top of postgres is that one can easily persist useful results to reuse in other problems. When I coded it in python, I had an efficient prime generating function (actually more than one) that I reused in several Euler problems. Well, when coding in postgres we can reuse the code, but the system suggests you to also reuse the data. It is a subtle difference but I appreciate it a lot.

Problem 3

This problem asks for the largest prime factor of 600851475143. Well, that’s a huge number. We don’t want to loop until there. Doing a minimum of math, we realize that the upper limit for the largest prime factor of N is sqrt(N). In this case, the number is below one million. So if we generate a table with all the primes below one million, we can solve this problem, and many others in the sequence.

We are going to use a variation of Sieve of Erastothenes method to compute this table. The central idea is to compute all numbers that are not prime (composite numbers) because they are easier to compute than the primes themselves. This variation is less efficient but it is easier to express in SQL. Here it is:

1
2
3
4
5
6
7
8
\set max_sieve 2000000
create table primes as (
  select          generate_series(2, :max_sieve) as n
    except
  select distinct generate_series(2*t1.n, :max_sieve, t1.n) 
             from generate_series(2, :max_sieve) as t1(n) 
         order by 1
)

The core of the algorithm lies in the second term of the except clause, starting by select distinct. What we have here is a double loop: the external loop is the generate_series call as a table in the from clause, and the inner loop is the generate_series call as a column in the select distinct list. The inner loop generates multiples of t1.n greater than itself. With this method, we generate all composite number up to a given threshold (:max_sieve in this case). Then, the rest of the except clause is trivial.

This took around 10 seconds on my computer, which is acceptable since we can persist the results. With respect to the original algorithm, we did a modification on the external loop. Using generate_series we can only express a fixed range, so we are generating much more items than needed. In the original Sieve’s, while we add numbers in the composite set, we remove these same numbers from the external loop. Let’s see the explain section of the core of our algorithm:

1
2
3
4
5
6
7
...
->  Subquery Scan on "*SELECT* 2"  (cost=58774941.24..66592433.43 rows=200000 width=8) (actual time=5513.864..10058.802 rows=921501 loops=1)
   ->  HashAggregate  (cost=58774941.24..66590433.43 rows=200000 width=4) (actual time=5513.860..9923.877 rows=921501 loops=1)
       ...
     ->  ProjectSet  (cost=0.00..5024994.99 rows=999999000 width=4) (actual time=0.040..1550.715 rows=11970035 loops=1)
	   ->  ProjectSet  (cost=0.00..5000.01 rows=999999 width=4) (actual time=0.031..105.927 rows=999999 loops=1)
           ...

From bottom up, we see that the outer loop generates 1 million rows, whereas in the original Sieve’s we would generate only 78498 (the total number of primes). Then, the inner loop generates almost 12 million rows, where the original woud generate 921501 (the total number of composites). Finally, we see that it takes almost 5 seconds to collapse the 12 million rows into 921501 distinct composite numbers.

We have a considerable performance penalty for solving this problem with an elegant query. Can we do it better with pure SQL? Maybe another day. Let’s conclude with the solution for problem 3.

1
2
3
4
5
6
  \set N 600851475143
  select number 
    from primes 
   where mod(:N,number) = 0 -- number is factor of :N
   order by number desc 
   limit 1
This post is licensed under CC BY 4.0 by the author.