#1458 [xml] Performance degradation scanning large XML files with XPath custom rules

PMD-5.3.7
closed
xml (1)
PMD
3-Major
Bug
5.4.1
2016-06-25
2016-02-09
hmozaffari
No

I have written some PMD custom XPath rules for XML files and some of verified files are large ( 18MB size / 400,000 lines). With PMD 4.2.6 I had no performance issue validating those files but after upgrading to 5.4.1 my memory consumption has increased and also it became almost 7 times slower to verify large XML file. I found a work around to make it as fast as 4.2.6 by removing line breaks from my XML file. But the drawback is PMD doesn’t report the correct line/column numbers.

I was able to track the performance degradation in the following class:
net.sourceforge.pmd.lang.xml.ast.DOMLineNumbers

It seems this class has been introduced in recent versions of PMD.

Discussion

  • Andreas Dangel

    Andreas Dangel - 2016-04-01
    • status: open --> in-progress
    • assigned_to: Andreas Dangel
    • Milestone: New Tickets --> PMD-5.3.7
     
  • Andreas Dangel

    Andreas Dangel - 2016-04-01

    Related issue: [#1054]. This was fixed with PMD 5.1.0. Before that, there has been no line/column numbers for XML in PMD.

    The difficulty here is, that the standard XML parsers do not retain the position of the nodes. Hence this DOMLineNumbers class basically parses the file again and recovers the position of each node.

    You made an interesting observation: The fact, that it's faster without line breaks could mean, that the bottleneck is not the parsing, but the mapping from absolute file position to line/column. I'll have a look...

     

    Related

    Issues: #1054

  • Andreas Dangel

    Andreas Dangel - 2016-04-01
    • status: in-progress --> closed
     
  • Andreas Dangel

    Andreas Dangel - 2016-04-01

    Thanks for raising this issue. I tried it out with a similar sized file (23M, 400,000 lines). I stopped PMD 5.4.1 after more than 30 minutes...

    I've improved the code and could analyze the same file now within 30 seconds(!).

    The fix will be included with PMD 5.3.7, 5.4.2 and 5.5.0.

    Commit: https://github.com/pmd/pmd/commit/494719d8ea74102313cbbca6c69bdc6e714f43d4

     
    Last edit: Andreas Dangel 2016-04-30
  • Andreas Dangel

    Andreas Dangel - 2016-06-25
    • labels: Performance Degradation "Large XML files" XPath "Custom Rule" --> xml
    • summary: Performance degradation scanning large XML files with XPath custom rules --> [xml] Performance degradation scanning large XML files with XPath custom rules
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks