From: Mantis B. T. <no...@bu...> - 2016-07-25 10:13:08
|
The following issue has been CONFIRMED. ====================================================================== http://bugs.bacula.org/view.php?id=2238 ====================================================================== Reported By: omahn Assigned To: ====================================================================== Project: Bacula Bug Reports Issue ID: 2238 Category: Director Reproducibility: always Severity: major Priority: low Status: confirmed ====================================================================== Date Submitted: 2016-07-25 09:23 BST Last Modified: 2016-07-25 11:13 BST ====================================================================== Summary: Overflow of integer column in basefiles Description: When using basefiles on large systems with billions of files it's possible to overflow the baseid column as it only uses an 'integer' type whereas (on PostgreSQL at least) the sequence that populates this column can generate 'bigint' values: bacula=# select sequence_name, last_value, max_value from basefiles_baseid_seq ; sequence_name | last_value | max_value ----------------------+------------+--------------------- basefiles_baseid_seq | 2147483662 | 9223372036854775807 (1 row) This causes the integer in basefiles to overflow as seen in this error after the basefiles table contains over the max integer value: 20-Jul 10:55 bacula-dir JobId 109374: Fatal error: Query failed: INSERT INTO BaseFiles (BaseJobId, JobId, FileId, FileIndex) SELECT B.JobId AS BaseJobId, 109374 AS JobId, B.FileId, B.FileIndex FROM basefile109374 AS A, new_basefile109374 AS B WHERE A.Path = B.Path AND A.Name = B.Name ORDER BY B.FileId: ERR=ERROR: integer out of range Changing baseid in the basefiles table from an integer to a bigint resolves the issue on a live system. In the bacula source changing baseid from serial to bigserial in src/cats/make_postgresql_tables.in should prevent the issue from occurring. Steps to Reproduce: 1. Configure a bacula environment using basefiles. 2. Run sufficient backups to record just over http://bugs.bacula.org/view.php?id=1#c2.1 billion (max int) base file references. Additional Information: Tested on 5.2.6 from Debian Jessie but the column types appear the same in the latest bacula releases. ====================================================================== ---------------------------------------------------------------------- (0007375) kern (administrator) - 2016-07-25 10:42 http://bugs.bacula.org/view.php?id=2238#c7375 ---------------------------------------------------------------------- Something is wrong in your analysis, but I am not sure what. BaseId is never used in Bacula, it is just a way to sort the based Job records. There is one BaseJob record per Base backup. This means that for the BaseId to overflow, you would have done more than 2 million backup jobs. That seems relatively impossible. The only SQL variable (column) that is greater than 2 billion with a lot of files backed is the FileId column, and it is always defined as bigserial or bigint. I see the error, but the only way to really understand what is going on is to run your Bacula under the debugger and trap it at the point the fatal error message is printed, then examine each of the values to see what they are. It may also be possible to enable some postgresql debug code so that we can know what value it is complaining about. By the way, it would not make any sense to increase the size of baseid unless we also increase the size of jobid. JobId can overflow, but only if you run more than 2 billion Bacula jobs (of all types), and in that case, with normal pruning I would hope that Postgresql would start using ids. ---------------------------------------------------------------------- (0007376) omahn (reporter) - 2016-07-25 10:51 http://bugs.bacula.org/view.php?id=2238#c7376 ---------------------------------------------------------------------- Maybe I missed something in my original explanation, my understanding is that the basefiles table contains an entry for every file in a backup job that wasn't actually backed during a full (or I/D) because it is already contained within a base backup. For example, here's 5 records from our basefiles table: bacula=# SELECT * FROM basefiles ORDER BY baseid DESC LIMIT 5; baseid | jobid | fileid | fileindex | basejobid ------------+--------+-------------+-----------+----------- 2235657177 | 109509 | 11123499778 | 7651558 | 108823 2235657176 | 109509 | 11123499777 | 7651555 | 108823 2235657175 | 109509 | 11123499776 | 7651548 | 108823 2235657174 | 109509 | 11123499775 | 7651545 | 108823 2235657173 | 109509 | 11123499774 | 7651486 | 108823 (5 rows) Does that make more sense now? As you can see, we have already overrun a normal int in baseid in basefiles. ---------------------------------------------------------------------- (0007377) kern (administrator) - 2016-07-25 11:12 http://bugs.bacula.org/view.php?id=2238#c7377 ---------------------------------------------------------------------- I think it was I who missed something not you. My only excuse is that I didn't write the code :-(. It does look like there is one baseid record per file -- thus clearly an overflow at some relatively early point. The person who wrote the code is on vacation right now, and I will want him to review any changes, but I think baseid has to be changed from serial to bigserial. I will carefully check, but I hope there are no other similar cases in other tables. Issue History Date Modified Username Field Change ====================================================================== 2016-07-25 09:23 omahn New Issue 2016-07-25 10:42 kern Note Added: 0007375 2016-07-25 10:42 kern Status new => feedback 2016-07-25 10:51 omahn Note Added: 0007376 2016-07-25 10:51 omahn Status feedback => new 2016-07-25 11:12 kern Note Added: 0007377 2016-07-25 11:13 kern Status new => confirmed ====================================================================== |