|
From: Lloyd B. <llo...@by...> - 2025-05-21 00:56:47
|
Hi all,
I'm running into an issue with some bacula-fd instances and hoping
someone can point me in the right direction.
In short: I have bacula-fd instances that are clearly running jobs
(confirmed via strace), but they often time out when I run status
client=CLIENTNAME. They only seem reliably responsive when idle.
Details:
* Bacula version: 9.6.6 (yes, I know it's old — upgrade is planned).
* Setup: Two hosts (`zhomebackup[1-2]`) running both SD and FD. A
script at the beginning of each job snapshots NFS shares, mounts
them, and outputs file paths for backups.
* Problem: These hosts struggle to handle more than 6–7 jobs
effectively. Going beyond that causes a drop in aggregate file scan
rates.
* Attempted solution: Spun up additional FD instances on separate
ports (originally inside Docker, but now just running natively on
non-standard ports). These new instances are /intermittently/
responsive to `status client`, even with only 1–3 jobs. The original
FD (on the default port) remains responsive, even with 6–7 jobs.
I'm wondering if this could be a shared resource issue or some FD
limitation I'm not accounting for. Or is there a better way to scale job
throughput?
I've attached a tarball containing systemd service files, FD configs,
and relevant parts of the Director config, including an example job
definition.
Any insights would be greatly appreciated.
Thanks,
Lloyd
--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://rc.byu.edu
|