Menu

Potential refactor of linked list data structure

agaeki
2022-05-11
2022-06-15
  • agaeki

    agaeki - 2022-05-11

    I've had a rework of CppCheck's underlying data structure in mind for a while now, since doing some work on CppCheck for my job, and I've recently found free time to start on it. I thought I'd share the details here and get your thoughts and feedback, rather than spend months on it and then learn it's a bad idea. The refactor/rewrite has two goals: To move the data structure logic out of the Token class, thereby making it easier to read and maintain, and to improve performance by ensuring that adjacent tokens are kept adjacent in memory to reduce cache misses.

    My plan is to implement this as an Unrolled Linked List, essentially a list of chunks where each chunk contains an array of Tokens thereby keeping the Tokens next to each other in memory. I will be implementing this in a separate ListChunk class that will handle adding, removing, and moving Tokens - several operations will involve resizing arrays but each resize will involve a small subset of Tokens so won't be expensive. To start with I will implement the interface to be called from the Token class so changes to the existing code and access patterns will be minimal, but once the new data structure is in place and stable it may be that some data access patterns can be changed to be cleaner or more optimal.

    I've made a good start on the change but it's a fairly major one so I expect going to be slow, I'll be pushing changes to my repo here https://github.com/agaeki/cppcheck and I'll make pull requests once I have something worth pulling.

    What do you think of my idea? Can you think of any potential obstacles that I might encounter?

     
  • john borland

    john borland - 2022-05-14

    One thing I would look into first is pfl::list https://plflib.org/list.htm it is a drop in replacement for std::list. It makes a list more like a vector. I'm not a cppcheck develper, just some who drops in on the forums from time to time.

    Given the small amount of benchmarking I have done a while back I don't think it will have very much of an effect on performance. Even with large token list the act of traversing the token list which it does many times in ValueFlow::setValues function has little effect on the overall performance of the function or not as much as one would think it would.

     
  • agaeki

    agaeki - 2022-06-10

    Unfortunately cppcheck doesn't use std::list so a direct replacement won't work. I've been working on this off-and-on for a while. I've got a sound data structure implementation that I think will do the job, but I've hit the snag that most of CppCheck's code relies on the fact that Tokens don't move in memory so my naive approach isn't good enough.

    I've got a couple of ideas that might make it work without a large-scale rewrite, I'll keep on bashing for a bit and see how I go, but unsurprisingly it's not as simple as I first hoped ;)

     
  • Daniel Marjamäki

    I am not sure what I think about this idea. If there is a significant speedup it's probably good.

    The idea of the TokenList was that it would be a kind of container for the linked token list.

    A task that I feel takes a lot of cpu is the Tokenizer. I imagine that it should somehow be possible to make Tokenizer a lot faster. Not sure if that would be interesting to consider.

     
  • john borland

    john borland - 2022-06-15

    Sorry I should of looked at the code more closely before making my suggestion. I still think the pfl list is a good thing to learn from. I find when people try to make a fast linked list they are basically trying to make a linked list more like a vector. One things that vectors can do that linked list can't is preallocate memory through its reserve function Given the wide ranges in file sizes that cppcheck processes it would be hard to guess a very good number, but maybe that's something that cppcheck could cache. There is the cppcheck flag --cppcheck-build-dir that already does some sort of cacheing for files that are unchanged. This might help in the case where a file has changed but could at least how big to make the starting size of the token list.

    It's been a while since I have looked at performance data for cppcheck so I can't make much of a comment other than the last time I looked at it the bulk of stack traces came from ValueFlow::setValues() but if I was going to follow anyone's cppcheck hunches I would definitely look into Daniel's.

    I have told my self I need to find some time to see if I can make it easier for cppcheck to make flame graphs. That's a tool I use at work to catch a lot of performance issues. But that is a story for a different thread.

     
  • john borland

    john borland - 2022-06-15

    I ran a quick flame graph last night on the head commit from last night making only the compiler flag changes and not my normal source file changes to devirtualize the value flow condition. I have attached the svg file which I normally uses firefox to look at. It still shows that a bulk of the time is spent in ValueFlow::setValues() stack traces. With value flow condition and value flow sub function being the biggest culprits in my test case.

    mkdir Build && cd Build
    export CXXFLAGS="-g -fno-omit-frame-pointer"
    cmake MATCHCOMPILER=YES -DCMAKE_BUILD_TYPE=Release ..
    make -j 6 >& build.out.out
    perf record -F 99 --output=head.out.perf.data -g ./bin/cppcheck check=all ../lib/astutils.cpp &> head.out.out

    for commit 176eefcbf31bcb4d1e47d1fc9d178e976c8c2fc3

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.