Thread: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

Brought to you by: csanchezdll, jakkep, kraiskil, mogurakun, schnetter

pocl-devel

[pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Noah R. <noa...@gm...> - 2018-12-21 22:10:28

Hi,

      I figured it is about time I give pocl a try with my physics
simulation code.   I've been using Intel's OpenCL library for computing on
Cray systems with Xeon CPU.
       Today I built pocl (today's git master ) on a Cray XC40
using clang+llvm-7.0.0-x86_64-linux-sles12.3
       I was able to run a simple Hello World kernel as well as clinfo.
When running my physics application at necessary scale, I'm seeing about
0.2% of clBuildProgram fail by SEGFAULT, all with a common stack signature.
(pasted below)
       I'm not sure why this would be so intermittent.  I've tried reducing
to one process per compute node, so only one clBuildProgram would be
executing on that node at a time.  In this testing, that leaves 90
processes doing the same program compile simultaneously in the same working
directory.   Is pocl or clang trying to write anything to the working
directory?  In my restricted case, /tmp is private to each compute node and
thus each process.
     Google-ing for similar stack language, I find one mention that may
well be the same bug:
https://www.mail-archive.com/llv...@li.../msg28677.html
https://bugs.llvm.org/show_bug.cgi?id=39833

    "poclcc" is successful with the same OpenCL kernel source.  I assume
I'd need to run it hundreds of times, perhaps in parallel to potentially
trigger the same bug.

      Any advice would be appreciated.  Now that I've thought through the
situation, I think I should probably create an account and contribute to
the LLVM bug 39833 discussion with a me-too.

Cheers,

Noah Reddell


  WmResidentPatchProcessor::WmResidentPatchProcessor(WmComputeProgram*,
boost::shared_ptr<WmComputeAssignment const>,
std::vector<boost::shared_ptr<WmSubDomain const>,
std::allocator<boost::shared_ptr<WmSubDomain const> > > const&,
WmComputeMachine&)@wmresidentpatchprocessor.cc:358
  POclBuildProgram@clBuildProgram.c:37
  compile_and_link_program@pocl_build.c:624
  pocl_llvm_build_program@pocl_llvm_build.cc:489

clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)@0x2aaaabebfd07
  clang::FrontendAction::Execute()@0x2aaaabf1c106
  clang::PrintPreprocessedAction::ExecuteAction()@0x2aaaabf22328
  clang::DoPrintPreprocessedInput(clang::Preprocessor&, llvm::raw_ostream*,
clang::PreprocessorOutputOptions const&)@0x2aaaabf51226
  clang::Preprocessor::EnterMainSourceFile()@0x2aaaacc1cabc
  clang::Preprocessor::EnterSourceFile(clang::FileID,
clang::DirectoryLookup const*, clang::SourceLocation)@0x2aaaacbf7407
  (anonymous
namespace)::PrintPPOutputPPCallbacks::FileChanged(clang::SourceLocation,
clang::PPCallbacks::FileChangeReason, clang::SrcMgr::CharacteristicKind,
clang::FileID)@0x2aaaabf5212d
  clang::SourceManager::getPresumedLoc(clang::SourceLocation, bool)
const@0x2aaaacc4e00e
  clang::SourceManager::getLineNumber(clang::FileID, unsigned int, bool*)
const@0x2aaaacc4e43a
  *ComputeLineNumbers*(clang::DiagnosticsEngine&,
clang::SrcMgr::ContentCache*,
llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>&,
clang::SourceManager const&, bool&)@0x2aaaacc4e683

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Michal B. <mic...@tu...> - 2018-12-27 19:13:51

Hello,


> Is pocl or clang trying to write anything to the working directory?  In my restricted case, /tmp is private to each compute node and thus each process.


Not to the working directory (AFAIK, i haven't inspected the entire Clang codebase), but pocl writes to its own cache directory, which by default is $HOME/.cache/pocl/kcache; you can change it to a different directory by setting the POCL_CACHE_DIR env variable.


IIRC there have been some issues before, when people had the cache dir located on NFS shares; is that your case (is your $HOME shared) ? You could try pointing POCL_CACHE_DIR to /tmp/pocl_cache and see if it makes the problem go away. It's possible pocl / Clang makes some assumption about filesystem which does not hold for NFS.


In the backtrace you pasted, it seems it's crashing in the preprocessing phase. Here pocl writes to a temporary file created by LLVM's sys::fs::createUniqueFile() which in turn uses open() with exclusive flag on a randomized  path.


Regards,

-- mb

________________________________
From: Noah Reddell <noa...@gm...>
Sent: Saturday, December 22, 2018 12:09:55 AM
To: poc...@li...
Subject: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

Hi,

      I figured it is about time I give pocl a try with my physics simulation code.   I've been using Intel's OpenCL library for computing on Cray systems with Xeon CPU.
       Today I built pocl (today's git master ) on a Cray XC40 using clang+llvm-7.0.0-x86_64-linux-sles12.3
       I was able to run a simple Hello World kernel as well as clinfo.   When running my physics application at necessary scale, I'm seeing about 0.2% of clBuildProgram fail by SEGFAULT, all with a common stack signature. (pasted below)
       I'm not sure why this would be so intermittent.  I've tried reducing to one process per compute node, so only one clBuildProgram would be executing on that node at a time.  In this testing, that leaves 90 processes doing the same program compile simultaneously in the same working directory.   Is pocl or clang trying to write anything to the working directory?  In my restricted case, /tmp is private to each compute node and thus each process.
     Google-ing for similar stack language, I find one mention that may well be the same bug:
https://www.mail-archive.com/llv...@li.../msg28677.html
https://bugs.llvm.org/show_bug.cgi?id=39833

    "poclcc" is successful with the same OpenCL kernel source.  I assume I'd need to run it hundreds of times, perhaps in parallel to potentially trigger the same bug.

      Any advice would be appreciated.  Now that I've thought through the situation, I think I should probably create an account and contribute to the LLVM bug 39833 discussion with a me-too.

Cheers,

Noah Reddell


  WmResidentPatchProcessor::WmResidentPatchProcessor(WmComputeProgram*, boost::shared_ptr<WmComputeAssignment const>, std::vector<boost::shared_ptr<WmSubDomain const>, std::allocator<boost::shared_ptr<WmSubDomain const> > > const&, WmComputeMachine&)@wmresidentpatchprocessor.cc:358
  POclBuildProgram@clBuildProgram.c:37
  compile_and_link_program@pocl_build.c:624
  pocl_llvm_build_program@pocl_llvm_build.cc:489
  clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)@0x2aaaabebfd07
  clang::FrontendAction::Execute()@0x2aaaabf1c106
  clang::PrintPreprocessedAction::ExecuteAction()@0x2aaaabf22328
  clang::DoPrintPreprocessedInput(clang::Preprocessor&, llvm::raw_ostream*, clang::PreprocessorOutputOptions const&)@0x2aaaabf51226
  clang::Preprocessor::EnterMainSourceFile()@0x2aaaacc1cabc
  clang::Preprocessor::EnterSourceFile(clang::FileID, clang::DirectoryLookup const*, clang::SourceLocation)@0x2aaaacbf7407
  (anonymous namespace)::PrintPPOutputPPCallbacks::FileChanged(clang::SourceLocation, clang::PPCallbacks::FileChangeReason, clang::SrcMgr::CharacteristicKind, clang::FileID)@0x2aaaabf5212d
  clang::SourceManager::getPresumedLoc(clang::SourceLocation, bool) const@0x2aaaacc4e00e
  clang::SourceManager::getLineNumber(clang::FileID, unsigned int, bool*) const@0x2aaaacc4e43a
  ComputeLineNumbers(clang::DiagnosticsEngine&, clang::SrcMgr::ContentCache*, llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>&, clang::SourceManager const&, bool&)@0x2aaaacc4e683

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Noah R. <noa...@gm...> - 2018-12-27 22:30:29

Hi Michal,
    Thank you for the suggestion of POCL_CACHE_DIR.   Setting this to a
tmps unique to each compute node immediately worked around the issue.
    I can now reliably run my application.
    On most Cray systems, $HOME is a DFS mount when mounted on compute
nodes.  I'm sure there are many similarities from DFS to NFS.

    I would think a better default location for the pocl cache (linux)
would be derived from $TMPDIR rather than $HOME.

   I wonder if sys::fs::createUniqueFile()  is not so unique after-all at
this scale?  Could this lead to a sort of race between the create and
open(exclusive)...?

Cheers,

Noah





On Thu, Dec 27, 2018 at 11:14 AM Michal Babej <mic...@tu...> wrote:

> Hello,
>
>
> > Is pocl or clang trying to write anything to the working directory?  In
> my restricted case, /tmp is private to each compute node and thus each
> process.
>
>
> Not to the working directory (AFAIK, i haven't inspected the entire Clang
> codebase), but pocl writes to its own cache directory, which by default is
> $HOME/.cache/pocl/kcache; you can change it to a different directory by
> setting the POCL_CACHE_DIR env variable.
>
>
> IIRC there have been some issues before, when people had the cache dir
> located on NFS shares; is that your case (is your $HOME shared) ? You could
> try pointing POCL_CACHE_DIR to /tmp/pocl_cache and see if it makes the
> problem go away. It's possible pocl / Clang makes some assumption about
> filesystem which does not hold for NFS.
>
>
> In the backtrace you pasted, it seems it's crashing in the preprocessing
> phase. Here pocl writes to a temporary file created by LLVM's sys::fs::createUniqueFile()
> which in turn uses open() with exclusive flag on a randomized  path.
>
>
> Regards,
>
> -- mb
> ------------------------------
> *From:* Noah Reddell <noa...@gm...>
> *Sent:* Saturday, December 22, 2018 12:09:55 AM
> *To:* poc...@li...
> *Subject:* [pocl-devel] intermittent clang ComputeLineNumbers SegFault
>
> Hi,
>
>       I figured it is about time I give pocl a try with my physics
> simulation code.   I've been using Intel's OpenCL library for computing on
> Cray systems with Xeon CPU.
>        Today I built pocl (today's git master ) on a Cray XC40
> using clang+llvm-7.0.0-x86_64-linux-sles12.3
>        I was able to run a simple Hello World kernel as well as clinfo.
> When running my physics application at necessary scale, I'm seeing about
> 0.2% of clBuildProgram fail by SEGFAULT, all with a common stack signature.
> (pasted below)
>        I'm not sure why this would be so intermittent.  I've tried
> reducing to one process per compute node, so only one clBuildProgram would
> be executing on that node at a time.  In this testing, that leaves 90
> processes doing the same program compile simultaneously in the same working
> directory.   Is pocl or clang trying to write anything to the working
> directory?  In my restricted case, /tmp is private to each compute node and
> thus each process.
>      Google-ing for similar stack language, I find one mention that may
> well be the same bug:
> https://www.mail-archive.com/llv...@li.../msg28677.html
> https://bugs.llvm.org/show_bug.cgi?id=39833
>
>     "poclcc" is successful with the same OpenCL kernel source.  I assume
> I'd need to run it hundreds of times, perhaps in parallel to potentially
> trigger the same bug.
>
>       Any advice would be appreciated.  Now that I've thought through the
> situation, I think I should probably create an account and contribute to
> the LLVM bug 39833 discussion with a me-too.
>
> Cheers,
>
> Noah Reddell
>
>
>   WmResidentPatchProcessor::WmResidentPatchProcessor(WmComputeProgram*,
> boost::shared_ptr<WmComputeAssignment const>,
> std::vector<boost::shared_ptr<WmSubDomain const>,
> std::allocator<boost::shared_ptr<WmSubDomain const> > > const&,
> WmComputeMachine&)@wmresidentpatchprocessor.cc:358
>   POclBuildProgram@clBuildProgram.c:37
>   compile_and_link_program@pocl_build.c:624
>   pocl_llvm_build_program@pocl_llvm_build.cc:489
>
> clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)@0x2aaaabebfd07
>   clang::FrontendAction::Execute()@0x2aaaabf1c106
>   clang::PrintPreprocessedAction::ExecuteAction()@0x2aaaabf22328
>   clang::DoPrintPreprocessedInput(clang::Preprocessor&,
> llvm::raw_ostream*, clang::PreprocessorOutputOptions const&)@0x2aaaabf51226
>   clang::Preprocessor::EnterMainSourceFile()@0x2aaaacc1cabc
>   clang::Preprocessor::EnterSourceFile(clang::FileID,
> clang::DirectoryLookup const*, clang::SourceLocation)@0x2aaaacbf7407
>   (anonymous
> namespace)::PrintPPOutputPPCallbacks::FileChanged(clang::SourceLocation,
> clang::PPCallbacks::FileChangeReason, clang::SrcMgr::CharacteristicKind,
> clang::FileID)@0x2aaaabf5212d
>   clang::SourceManager::getPresumedLoc(clang::SourceLocation, bool)
> const@0x2aaaacc4e00e
>   clang::SourceManager::getLineNumber(clang::FileID, unsigned int, bool*)
> const@0x2aaaacc4e43a
>   *ComputeLineNumbers*(clang::DiagnosticsEngine&,
> clang::SrcMgr::ContentCache*,
> llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>&,
> clang::SourceManager const&, bool&)@0x2aaaacc4e683
>
>
>
> _______________________________________________
> pocl-devel mailing list
> poc...@li...
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Pekka J. <pek...@tu...> - 2018-12-28 08:47:58

Hi Noah,

> I would think a better default location for the pocl cache (linux) would 
> be derived from $TMPDIR rather than $HOME.

It used to be under /tmp, but then someone had an issue with a 
multi-node NFS-mounted system with CPUs with incompatible ISA getting 
the same binaries, IIRC.

I'm really not sure what would be the best overall default for it . A 
/tmp/XXX dir that is unique per node?

This might be related: https://github.com/pocl/pocl/issues/430

BR,
-- 
Pekka

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Michal B. <mic...@tu...> - 2018-12-28 09:10:05

Hello Noah,


> I would think a better default location for the pocl cache (linux) would be derived from $TMPDIR rather than $HOME.

Having it on /tmp on many systems makes the cache non-persistent, which kind of defeats the purpose of having a cache in the first place... perhaps there is a more suitable place, but i'm not aware of it.

> I wonder if sys::fs::createUniqueFile()  is not so unique after-all at this scale?  Could this lead to a sort of race between the create and open(exclusive)...?

I'm 99.9% sure it's unique. I'm not sure what race you have in mind, but IIRC LLVM just appends a random string to a template filename, then tries open(O_CREAT | O_EXCL), and repeats if that fails. Pocl then closes the descriptor and hands over the filename to Clang's preprocessor. It's possible Clang removes the file before re-opening to write into it, or there is something else going on which triggers a bug.

Regards,
-- mb

________________________________
From: Noah Reddell <noa...@gm...>
Sent: Friday, December 28, 2018 12:30:01 AM
To: Portable Computing Language development discussion
Subject: Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

Hi Michal,
    Thank you for the suggestion of POCL_CACHE_DIR.   Setting this to a tmps unique to each compute node immediately worked around the issue.
    I can now reliably run my application.
    On most Cray systems, $HOME is a DFS mount when mounted on compute nodes.  I'm sure there are many similarities from DFS to NFS.

    I would think a better default location for the pocl cache (linux) would be derived from $TMPDIR rather than $HOME.

   I wonder if sys::fs::createUniqueFile()  is not so unique after-all at this scale?  Could this lead to a sort of race between the create and open(exclusive)...?

Cheers,

Noah





On Thu, Dec 27, 2018 at 11:14 AM Michal Babej <mic...@tu...<mailto:mic...@tu...>> wrote:

Hello,


> Is pocl or clang trying to write anything to the working directory?  In my restricted case, /tmp is private to each compute node and thus each process.


Not to the working directory (AFAIK, i haven't inspected the entire Clang codebase), but pocl writes to its own cache directory, which by default is $HOME/.cache/pocl/kcache; you can change it to a different directory by setting the POCL_CACHE_DIR env variable.


IIRC there have been some issues before, when people had the cache dir located on NFS shares; is that your case (is your $HOME shared) ? You could try pointing POCL_CACHE_DIR to /tmp/pocl_cache and see if it makes the problem go away. It's possible pocl / Clang makes some assumption about filesystem which does not hold for NFS.


In the backtrace you pasted, it seems it's crashing in the preprocessing phase. Here pocl writes to a temporary file created by LLVM's sys::fs::createUniqueFile() which in turn uses open() with exclusive flag on a randomized  path.


Regards,

-- mb

________________________________
From: Noah Reddell <noa...@gm...<mailto:noah.reddell%2B...@gm...>>
Sent: Saturday, December 22, 2018 12:09:55 AM
To: poc...@li...<mailto:poc...@li...>
Subject: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

Hi,

      I figured it is about time I give pocl a try with my physics simulation code.   I've been using Intel's OpenCL library for computing on Cray systems with Xeon CPU.
       Today I built pocl (today's git master ) on a Cray XC40 using clang+llvm-7.0.0-x86_64-linux-sles12.3
       I was able to run a simple Hello World kernel as well as clinfo.   When running my physics application at necessary scale, I'm seeing about 0.2% of clBuildProgram fail by SEGFAULT, all with a common stack signature. (pasted below)
       I'm not sure why this would be so intermittent.  I've tried reducing to one process per compute node, so only one clBuildProgram would be executing on that node at a time.  In this testing, that leaves 90 processes doing the same program compile simultaneously in the same working directory.   Is pocl or clang trying to write anything to the working directory?  In my restricted case, /tmp is private to each compute node and thus each process.
     Google-ing for similar stack language, I find one mention that may well be the same bug:
https://www.mail-archive.com/llv...@li.../msg28677.html
https://bugs.llvm.org/show_bug.cgi?id=39833

    "poclcc" is successful with the same OpenCL kernel source.  I assume I'd need to run it hundreds of times, perhaps in parallel to potentially trigger the same bug.

      Any advice would be appreciated.  Now that I've thought through the situation, I think I should probably create an account and contribute to the LLVM bug 39833 discussion with a me-too.

Cheers,

Noah Reddell


  WmResidentPatchProcessor::WmResidentPatchProcessor(WmComputeProgram*, boost::shared_ptr<WmComputeAssignment const>, std::vector<boost::shared_ptr<WmSubDomain const>, std::allocator<boost::shared_ptr<WmSubDomain const> > > const&, WmComputeMachine&)@wmresidentpatchprocessor.cc:358
  POclBuildProgram@clBuildProgram.c:37
  compile_and_link_program@pocl_build.c:624
  pocl_llvm_build_program@pocl_llvm_build.cc:489
  clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)@0x2aaaabebfd07
  clang::FrontendAction::Execute()@0x2aaaabf1c106
  clang::PrintPreprocessedAction::ExecuteAction()@0x2aaaabf22328
  clang::DoPrintPreprocessedInput(clang::Preprocessor&, llvm::raw_ostream*, clang::PreprocessorOutputOptions const&)@0x2aaaabf51226
  clang::Preprocessor::EnterMainSourceFile()@0x2aaaacc1cabc
  clang::Preprocessor::EnterSourceFile(clang::FileID, clang::DirectoryLookup const*, clang::SourceLocation)@0x2aaaacbf7407
  (anonymous namespace)::PrintPPOutputPPCallbacks::FileChanged(clang::SourceLocation, clang::PPCallbacks::FileChangeReason, clang::SrcMgr::CharacteristicKind, clang::FileID)@0x2aaaabf5212d
  clang::SourceManager::getPresumedLoc(clang::SourceLocation, bool) const@0x2aaaacc4e00e
  clang::SourceManager::getLineNumber(clang::FileID, unsigned int, bool*) const@0x2aaaacc4e43a
  ComputeLineNumbers(clang::DiagnosticsEngine&, clang::SrcMgr::ContentCache*, llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>&, clang::SourceManager const&, bool&)@0x2aaaacc4e683



_______________________________________________
pocl-devel mailing list
poc...@li...<mailto:poc...@li...>
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Noah R. <noa...@gm...> - 2018-12-28 17:31:27

>
> Having it on /tmp on many systems makes the cache non-persistent, which
> kind of defeats the purpose of having a cache in the first place... perhaps
> there is a more suitable place, but i'm not aware of it.
>
There's a complex set of factors to balance for sure.  Since the default
behavior is to remove build products, I don't think the default
POCL_CACHE_DIR needs to be persistent storage. $HOME is generally going to
be slower and farther away than $TMPDIR.
    Most importantly the behavior is already customizable through
POCL_CACHE_DIR variable. I have a work-around.  A general user wouldn't
know to adjust the variable upon encountering this SEGFAULT unless
discovering record of this discussion.
    The lingering problem is that we don't understand what is driving the
clang SEGFAULT but it seems most likely related to false success
of open(O_CREAT | O_EXCL) on this DVS filesystem.  (speculating this
encounters same issue as older NFS filesystem)  In addition to the working
local /tmp for POCL_CACHE_DIR, I tried a Lustre parallel filesystem path
(common to all compute nodes).  This works as well, presumably because this
more sophisticated filesystem is correctly supporting O_EXCL.


Side question:  when I export POCL_VECTORIZER_REMARKS=1, where should the
output go?  I'm not seeing anything in the stdout/stderr streams
or ${POCL_CACHE_DIR}/*/*/build.log


>
>

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Andreas K. <li...@in...> - 2018-12-29 13:53:26

Attachments: signature.asc

Noah,

Noah Reddell <noa...@gm...> writes:
>> Having it on /tmp on many systems makes the cache non-persistent, which
>> kind of defeats the purpose of having a cache in the first place... perhaps
>> there is a more suitable place, but i'm not aware of it.
>>
> There's a complex set of factors to balance for sure.  Since the default
> behavior is to remove build products, I don't think the default
> POCL_CACHE_DIR needs to be persistent storage. $HOME is generally going to
> be slower and farther away than $TMPDIR.
>     Most importantly the behavior is already customizable through
> POCL_CACHE_DIR variable. I have a work-around.  A general user wouldn't
> know to adjust the variable upon encountering this SEGFAULT unless
> discovering record of this discussion.

~/.cache (or, really, whatever $XDG_CACHE_HOME points to) is the default
location for "user-specific non-essential data files" under the XDG Base
Directory Specification [1]. While that's a desktop-focused spec, it
establishes a convention that is independent of the desktop use case per
se. In particular, all parts of the spec are applicable (in a technical
sense) even in a command line context.

Arguably, the machine you are using should be configured to put
$XDG_CACHE_HOME someplace sensible (ideally, on a per-compute-node
FS). IMO, this would be a much preferable outcome compared to inventing
yet another convention or reverting to someplace in $TMPDIR, which is
insecure in a multi-user workstation/desktop setting.

[1] https://specifications.freedesktop.org/basedir-spec/latest/

Andreas

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Pekka J. <pek...@tu...> - 2018-12-30 11:55:48

Hi,

I'm afraid the vec remarks feature got broken with the latest LLVMs and no one has had spare time to fix it. Should not be too difficult to fix though if you want to give it a try.

https://github.com/pocl/pocl/issues/613

BR,
Pekka

Pekka Jääskeläinen

________________________________
From: Noah Reddell <noa...@gm...>
Sent: Friday, December 28, 2018 7:30:57 PM
To: Portable Computing Language development discussion
Subject: Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault



Having it on /tmp on many systems makes the cache non-persistent, which kind of defeats the purpose of having a cache in the first place... perhaps there is a more suitable place, but i'm not aware of it.
There's a complex set of factors to balance for sure.  Since the default behavior is to remove build products, I don't think the default POCL_CACHE_DIR needs to be persistent storage. $HOME is generally going to be slower and farther away than $TMPDIR.
    Most importantly the behavior is already customizable through POCL_CACHE_DIR variable. I have a work-around.  A general user wouldn't know to adjust the variable upon encountering this SEGFAULT unless discovering record of this discussion.
    The lingering problem is that we don't understand what is driving the clang SEGFAULT but it seems most likely related to false success of open(O_CREAT | O_EXCL) on this DVS filesystem.  (speculating this encounters same issue as older NFS filesystem)  In addition to the working local /tmp for POCL_CACHE_DIR, I tried a Lustre parallel filesystem path (common to all compute nodes).  This works as well, presumably because this more sophisticated filesystem is correctly supporting O_EXCL.


Side question:  when I export POCL_VECTORIZER_REMARKS=1, where should the output go?  I'm not seeing anything in the stdout/stderr streams or ${POCL_CACHE_DIR}/*/*/build.log

Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault

From: Michal B. <mic...@tu...> - 2018-12-31 10:07:45

Hi,


> Since the default behavior is to remove build products,


Actually the default behavior is to remove the intermediate build products, but keep the final product (.so dynamic library).


> $HOME is generally going to be slower and farther away than $TMPDIR.


That's true, but unless $HOME is on the other side of the planet, compilation is likely going to be much slower than whatever filesystem the $HOME is on.


> A general user wouldn't know to adjust the variable upon encountering this SEGFAULT unless discovering record of this discussion.


You're right. I will make a note of this issue in the documentation. I'm reluctant to change the current behavior though, as $HOME on network filesystem seems to be the exception rather than the rule.

> it seems most likely related to false success of open(O_CREAT | O_EXCL)

>From quick research (AKA googling), that seems to be the case indeed, at least for some versions of NFS. If someone with a networked setup comes up with a patch for open(create-exclusive) replacement that works on NFS, and does not break on local filesystems, we'll be happy to accept it.

Regards,
-- mb

________________________________
From: Noah Reddell <noa...@gm...>
Sent: Friday, December 28, 2018 7:30:57 PM
To: Portable Computing Language development discussion
Subject: Re: [pocl-devel] intermittent clang ComputeLineNumbers SegFault



Having it on /tmp on many systems makes the cache non-persistent, which kind of defeats the purpose of having a cache in the first place... perhaps there is a more suitable place, but i'm not aware of it.
There's a complex set of factors to balance for sure.  Since the default behavior is to remove build products, I don't think the default POCL_CACHE_DIR needs to be persistent storage. $HOME is generally going to be slower and farther away than $TMPDIR.
    Most importantly the behavior is already customizable through POCL_CACHE_DIR variable. I have a work-around.  A general user wouldn't know to adjust the variable upon encountering this SEGFAULT unless discovering record of this discussion.
    The lingering problem is that we don't understand what is driving the clang SEGFAULT but it seems most likely related to false success of open(O_CREAT | O_EXCL) on this DVS filesystem.  (speculating this encounters same issue as older NFS filesystem)  In addition to the working local /tmp for POCL_CACHE_DIR, I tried a Lustre parallel filesystem path (common to all compute nodes).  This works as well, presumably because this more sophisticated filesystem is correctly supporting O_EXCL.


Side question:  when I export POCL_VECTORIZER_REMARKS=1, where should the output go?  I'm not seeing anything in the stdout/stderr streams or ${POCL_CACHE_DIR}/*/*/build.log