How to speed up a PostgreSQL database
It all started when a certain report crept up to about 4.5 minutes to compile. All the complaints let us know something had to be done to speed up the database. We started by increasing the memory from 256M to 1Gig. This shaved about 2 minutes off, which left us at 2.5 minutes. Using the 'top' command it showed that there was no more virtual memory being used. This led us to the conclusion that adding more memory might not get us to where we want to be.
The next step was fine tuning the PostgreSQL database. Inside of the postgresql.conf file there are various tweaks to memory allocation you can do. After trying all the information available on the internet, there was one change that showed the most improvement. This item was shared_buffers. Originally, it's default seems to be set at 1000. We bumped it to 3250, which was for the most part the best value before the PostgreSQL database quit working. It shaved us down to about 2 minutes flat, but this still isn't good enough. We needed it down to a couple seconds.
The question was asked "What about a stored procedure?". So we took a look at what type of performance increase using a stored procedure might provide. The results we're minimal. I think stored procedures are more for cleaning up syntax then anything else. Is it possible that stored procedure written correctly might have been able to change the performance. Yes, it's possible, but my gut was telling me to look in another direction.
We decided to take a closer look at the queries. Using the EXPLAIN ANAYLZE command, it allowed us to see the process time of the specified query. Here is how it works. Let's assume you have the following query.
SELECT id, firstname, lastname, enterdate FROM employees WHERE enterdate >= '2005-03-01' AND enterdate <= '2005-03-31';
You can use the EXPLAIN command alone for the database to guestimate what the actual time to run the query might be. Please note this doesn't run the query. It just guestimates what the run time might be.
EXPLAIN SELECT id, firstname, lastname, enterdate FROM employees WHERE enterdate >= '2005-03-01' AND enterdate <= '2005-03-31';
Using the EXPLAIN ANAYLZE command actually runs the command and gives you the acutal run time values.
EXPLAIN ANAYLZE SELECT id, firstname, lastname, enterdate FROM employees WHERE enterdate >= '2005-03-01' AND enterdate <= '2005-03-31';
Both commands will give you a report similiar to the one below.
Hash IN Join (cost=25.02..50.09 rows=7 width=33) (actual time=265.523..455.897 rows=13864 loops=1)
Hash Cond: ("outer".id = "inner".id)
-> Seq Scan on genericdata (cost=0.00..20.00 rows=1000 width=33) (actual time=0.011..181.935 rows=139900 loops=1)
-> Hash (cost=25.00..25.00 rows=6 width=4) (actual time=12.290..12.290 rows=0 loops=1)
-> Seq Scan on generic (cost=0.00..25.00 rows=6 width=4) (actual time=6.684..10.909 rows=1147 loops=1)
Filter: ((enterdate >= '2005-03-01'::date) AND (enterdate <= '2005-03-31'::date))
Total runtime: 528.500 ms
The biggest part of the "EXPLAIN report" I was concerned about was the "Total runtime: 528.500 ms". This told us it takes about 1/2 a second for just one query to complete. In references to the actual webpage that we're trying to speed up, it requires almost 500 query sent to the database due to the enormous data it computes and generates.
The question now is how to speed it up. The actual table itself has over 100,000 records to sort through. At first I tried changing my acutal table from an "INNER JOIN" to a "WHERE id IN (SELECT id FROM blahblah WHERE enterdate >= '2005-03-01' AND enterdate <= '2005-03-31');". This only shaved about 10 milliseconds off the time. There was also a key referencing technique that did show some improvement, but the actual syntax slips my mind. I tried going through my history to find it, but after 15 minutes of searching I decided it wasn't important enough at this point since there is a better solution waiting.
Finally, I remembered something about temporary tables from years ago. I had only heard of them, but had never really used them. Would it really help though? I've had conversations with other programmers about seperating databases apart. Some held strong to the point that seperating databases apart to reduce the database size wouldn't do much. Here was a perfect time to put the theory to the test. On my test database I deleted all the records from last year which reduced the record count to 10,000. When we generated the report it reduced the compile time down to about 4 seconds. Wow, that's the perfomance change we've been looking for. Here's where TEMPORARY tables fits in. You can select anything you want into a TEMPORARY table. At the end of the database session the TEMPORARY table is removed. At the start of the session we create a TEMP table using the following syntax:
SELECT id, firstname, lastname, enterdate FROM employees INTO TEMP temp_employee WHERE enterdate >= '2005-03-01' AND enterdate <= '2005-03-31';
Now we can use the temporary table "temp_employee" for all our calculations. This by far was the fastest way, I had found, to speed up the webpage. By all means, adding memory was good but didn't give us the performance boost that we needed. Overall, PostgreSQL has been a good database over the past couple years. This is the first time I've really had any problems with it. I recommend it to anyone looking for a good database solution. Especially since it's open source and free. I've also read somewhere that PostgreSQL is used to manage the .org domains.