cl-string-match Code

String matching algorithms implementation for Lisp

Status: Alpha

Brought to you by: anyakin

Tree [6387d5] default /

History

Read Only access

File	Date	Author	Commit
bench	2015-07-30	vityok	[69654e] profile TABAC, dump Lisp version/implementation...
contrib	2015-09-03	vityok	[fb214b] ascii: improved line reader, explicitly fail no...
site	2014-02-17	Victor	[4070dc] added info about the mirror
src	2015-08-07	vityok	[b1318f] doc updates for aho-corasick
t	2015-09-03	vityok	[fb214b] ascii: improved line reader, explicitly fail no...
.hgtags	2015-09-03	vityok	[6387d5] Added tag rel-03sep2015 for changeset 08df63e1de15
ChangeLog	2015-09-03	vityok	[08df63] updated version information
LICENSE	2013-04-11	Victor	[56de9e] added todo in the readme, license information
README.md	2015-08-29	vityok	[dca1f6] changed links, added cl-test-grid link
ascii-strings.asd	2013-12-24	Victor	[2b4d5c] some char functions, minor tests reorganization...
cl-string-match-test.asd	2015-08-09	vityok	[f5710c] all tests are now a part of the test package, P...
cl-string-match.asd	2015-09-03	vityok	[08df63] updated version information

Read Me

CL-STRING-MATCH aims at providing robust implementations of string
matching algorithms. These algorithms are also called "substring
search"
or "subsequence search" algorithms.

Currently it provides implementations of the following string matching
algorithms (see wiki for details):

Brute-force (also known as naïve algorithm)
Boyer-Moore (with mismatched character heuristic and good suffix shift)
Boyer-Moore-Horspool algorithm
Knuth-Morris-Pratt algorithm
Rabin-Karp algorithm
Shift-OR algorithm (single pattern)
Aho-Corasick algorithm (with finite set of patterns, forward
transition and fail function)
A simple backtracking regular expressions engine
re similar to that of the Lua
programming language. At the moment it significantly underperforms
compared to the CL-PPCRE.

Some string processing algorithms are also implemented:

Simple (naїve) suffix tree construction algorithm
Ukkonen's suffix tree construction algorithm

Data structures:

Prefix trie
Suffix tree

Utilities:

Testing whether a string has the given suffix or prefix (starts with
or ends with the pattern)

Some algorithms (Brute-force, Boyer-Moore-Horspool) have parametric
implementations (code templates) making it possible to declare
specific implementations for application-specific custom data types
and data structures.

This library is routinely tested on Steel Bank CL, Clozure CL,
Embeddable CL and Armed Bear CL. Chances are really high that it will
work on other platforms without problems (check its status on
CL-TEST-GRID).

Check the API Reference for more details.

Additional resources:

RATIONALE

Since the standard search function is working fine, one might ask:
why do we need a yet another implementation? Answer is simple:
advanced algorithms offer different benefits compared to the standard
implementation that is based on the brute-force algorithm.

Benchmarks
show that depending on environment and pattern of application, a
Boyer-Moore-Horspool algorithm implementation can outperform standard
search function in SBCL by almost 18 times! Check the code in the
bench folder for further details.

USAGE

CL-STRING-MATCH is supported by Quicklisp and is known by its system name:

(ql:quickload :cl-string-match)

CL-STRING-MATCH exports functions in cl-string-match package (that
is also nicknamed as sm).

Shortcut functions search given pattern pat in text txt. They are
usually much slower (because they build index structures every time
they are called) but are easier to use:

string-contains-brute pat txt — Brute-force
string-contains-bm pat txt — Boyer-Moore
string-contains-bmh pat txt — Boyer-Moore-Horspool
string-contains-kmp pat txt — Knuth-Morris-Pratt
string-contains-ac pat txt — Aho-Corasick
string-contains-rk pat txt — Rabin-Karp

A more robust approach is to use pre-calculated index data that is
processed by a pair of initialize and search functions:

initialize-bm pat and search-bm bm txt
initialize-bmh pat and search-bmh bm txt
initialize-bmh8 pat and search-bmh8 bm txt
initialize-rk pat and search-rk rk txt
initialize-kmp pat and search-kmp kmp txt
initialize-ac pat and search-ac ac txt. initialize-ac
can accept a list of patterns that are compiled into a trie.

Brute-force algorithm does not use pre-calculated data and has no
"initialize" function.

Boyer-Moore-Horspool implementation (the -BMH and -BMH8 functions)
also accepts :start2 and :end2 keywords for the "search" and
"contains" functions.

Following example looks for a given substring pat in a given line of
text txt using Boyer-Moore-Horspool algorithm implementation:

(let ((idx (initialize-bmh "abc")))
  (search-bmh idx "ababcfbgsldkj"))

Counting all matches of a given pattern in a string:

(loop with str = "____abc____abc____ab"
      with pat = "abc"
      with idx = (sm:initialize-bmh8 pat)
      with z = 0 with s = 0 while s do
       (when (setf s (sm:search-bmh8 idx str :start2 s))
     (incf z) (incf s (length pat)))
     finally (return z))

It should be noted that Boyer-Moore-Horspool (bmh) implementation
can offer an order of magnitude boost to performance compared to the
standard search function.

However, some implementations create a "jump table" that can be the
size of the alphabet (over 1M CHAR-CODE-LIMIT on implementations
supporting Unicode) and thus consume a significant chunk of
memory. There are different solutions to this problem and at the
moment a version for the ASCII strings is offered: initialize-bmh8
pat and search-bmh8 bm txt as well as string-contains-bmh8
pat txt work for strings with characters inside the 256 char code
limit.

CONTRIB

This project also contains code that is not directly invloved with the
pattern search algorithms but nevertheless might be found useful for
text handling/processing. Check the contrib folder in the repository
for more details. Currently it contains:

ascii-strings.lisp aims to provide single-byte strings
functionality for Unicode-enabled Common Lisp implementations. Another
goal is to reduce memory footprint and boost performance of the
string-processing tasks, i.e. read-line.

TODO

The project still lacks some important features and is under constant
development. Any kind of contributions or feedback are welcome.

Please take a look at the list of open issues or the Project Roadmap.

Visit project Wiki for additional information.

cl-string-match Code

String matching algorithms implementation for Lisp

Branches

Tags

Tree [6387d5] default /

History

Read Me

RATIONALE

USAGE

CONTRIB

TODO

cl-string-match Code

String matching algorithms implementation for Lisp

Branches

Tags

Tree [6387d5] default / Download Snapshot History

Read Me

RATIONALE

USAGE

CONTRIB

TODO

Tree [6387d5] default /

History