Hi,

I was pointed out to this thread. I wrote trace_inputs.py. It was used to isolate chromium tests and figure out which files are touched by each test case. It doesn't care about anything except files being opened so strace log parsing was kept to a minimum. Note that I've been mostly using the stock ubuntu strace version, which is quite old. Some of the issues below may have been fixed already. For background context, note that I wrote the tool in python because 1) I'm lazy and 2) it was easier to use a single tool to deploy quickly on linux, OSX and Windows and that the goal of this tool is to hide as much as possible the underlying differences between each OS.

In practice, I stumbled upon a lot of issues while reading the logs;
- Processes being killed arbitrarily, causing all sorts of incomplete logs. The chromium unit tests tends to open a lot of processes quickly and kill them just for fun.
- Using a single log file resulted in severely corrupted log file, looks like multithreading is not properly interlocked in the output file.
- Even using one log file per process with -ff still cause occasional corrupted log file on heavily multithreaded processes. Note that using a workstation will a lots of core may probably help exhibit this problem more.
- No clear separation of call / return in the logs, causing a funny search to figure out which call completed. There's an heuristic in Strace.Content.Process._pending_calls.
- Encoding, since linux doesn't prescribe utf-8, I have to hope for the best. Note: I should use -x, I don't recall why I had commented it out. Maybe just laziness.
- Encoding, I had to write a state machine to read the logs properly, see the >110 lines of strace_process_quoted_arguments(). That's independent of -x.
- No clear parent-child relationship between the logs when using -ff, causing the reader code to play a game to figure out which process started which processes.
- Only after the process hierarchy is figured out I could figure out the starting current working directory of each process, causing a lot of "late bound path". See RelativePath class. It definitely slows down reading the logs and adds potential corruption of the paths in the trace reader if it incorrectly guessed the starting cwd.
- Generally inconsistent output, see all the RE_* in Strace.Content. No regexp is guaranteed to be able to read a log line.
- No header to list the hooked functions I should expect to appear, and the number of arguments in each. Since it changes from OS version to another, I made the code assert on unknown function so I can add them manually.
- Variable number of arguments to a function coupled with split log lines and interesting quoting caused some headaches IIRC. (Or was it OSX doing this? Sorry I don't recall)

Of all the problems, the log corruption with heavily multithreaded process on a multi-core system has been the most problematic. Everything else is solvable but when the log is corrupted, there isn't much to do. Last time I tried, which is a long long time ago, I was still able to reproduce occasional corruptions with our test named "browser_tests".

A separate issue I got was that I was not able to trace native client sandboxed unit test processes unless strace was running under sudo but to do that I need to attach to the already started process to be able to not run the unit test itself as root. It's been a pain to implement. Note strictly strace's fault, just FYI.


What I'd recommend;
- json is good for smallish document but is not great for unbounded documents. Logs can get >2gb. IMHO, I'm not sure it's a good fit. Especially with regards to 64 ints and json assumes unicode, which  may not be correct on some systems. That said, I don't have a good recommendation either. Something custom would be fine with me. Just expose a reference code to read it back in 1 language, 2-3 popular languages if possible.
- Optionally listing at the start the current process info; pid, ppid, cwd, list of hooked functions and the number of args for each, etc. Would save a *lot* of processing when reading the logs of a process tree.
- Always generate 2 entries per call, one enter, one exit. Don't try to be smart and merge when possible. Be consistent instead! The exit call should be extremely easy to match with the corresponding enter. In particular in case of reentrancy. It's less messy if the thread gets killed in the middle of execution.
- python is ridiculously slow at parsing, ensure whatever you do doesn't make reading the log *too* slow. In particular, keep support for streaming the log instead of forcing to read all the log at once, like a big json file.


It may look like I was dismayed at strace log parsing but you haven't see Windows' tracerpt.exe output. See LogmanTrace.Tracer.CsvReader.next() to a contest of which log format is the worst to read. Whoever wrote this junk shouldn't be allowed to code ever again.

I hope my comments are useful.

Thanks,

M-A