Tuesday, September 05, 2006

Postgresql Mod - Backup parts of a table

Well I got a chance to get into the postgres code although for a minor code addition. It turned out to be pretty cool just trying to understanding a small part of the source code.

Requirement.
1] Dump out only a part of a database table. eg oldest entries first

Approach
1] Now there are multiple ways to do it. For eg. you can go ahead and make a duplicate temp table with the rows that you want to dump out and then use the pg_dump tool to get you a copy as a binary file. Now that turns out to be a good idea if the amount of data to be moved around is relatively small as compared to the database, since the duplication will require a proportionate amount of space itself right.
2] The other way is to go ahead and modify the pg_dump tool itself to provide you with a partial dump.
Ofcourse keeping in with tradition we take the tougher approach ie the second one.


Once you get into the pg_dump source you realise that all it does is a dumb COPY command which does the real man's job of accessing the data for a given table/database and putting it out a given file. Onto the COPY source code in $PGSOURCE/src/backend/commands/copy.c

So what the COPY command does internally is to get to the database/table requested and redirect the binary file onto the specified file. Now if you want to dump out part of a table then you need some sort of query to be able to partition the part of the table that you desire. However doing this turns out to be very inefficienct since that would require you to have accesses into the table to get each row and then do comparisons on it and selectively dump to file. Not a good thing at all considering the large number of entries that we have.

Now since our requirement specifies that we need a part of the table to be dumped out, more specifically the older data we can work on the table if it has some timelined indices. So searching high and low in the docs I came across the funda of OIDs. Now OIDs or Object Identifiers are 4byte (on a 32bit m/c) integers that are unique per tuple on every table within the database. Moreover OIDs are accessed for dumps, it is a parameter to pg_dump, lucky us. Moreover since OIDs are appended to the tuples by the database itself on insertion we can rely on an effective technique being used in there. Something better than having triggers on each insert I hope!

So I go ahead and add two more parameters to the COPY command
1. from_oid
2. upto_oid
which gives us the range of OID's that we are interested in. We should be able to get these two values by checking the min and max values of the OIDs for a given table and simply do a percentage addition on the min value to get the upto_oid. You also need to make sure the parser recognises they new options so head off to $PGSOURCE/src/backend/parser/gram.y and good luck.

so now my customised postgres works as such
pg_dump from_oid 1634 upto_oid 1734
(internally)
COPY -t tablename TO filename WITH OIDS FROM_OID 1634 UPTO_OID 1734;


And if you really want to do some meaningful stuff. Take a look at this dude's webpage. Thanks Neil the tips were really helpful.

Technorati Tags:

No comments: