craig - 2015-08-11

One of the documents I ship is a ROADMAP.txt with various notes I keep on feature ideas for future crgrep releases. I'll post the latest (1.0.4) roadmap here rather than require a full download to view it. I'm looking for comments on the list I've compiled and any new resource types you suggest I might add to the list.

ROADMAP

Backlog of possible features, bugs, fixes, extensions.

Versions

within the next couple of releases:
ResultMatcher
-v invert match
filters / --include/--exclude options

Resource Types - General / New

. other archive formats: bzip, ar, cpio, dump etc
. OpenOffice ODF formats
. RSS feeds. rss/torrent grep for title, description as 'category.title'. RSS/REST only.
rss structure: http://www.landofcode.com/rss-tutorials/rss-structure.php
'foo' 'channel-category.item-category'
for <category>
. social network grep. fb/t/li
. Decompile .class files and grep result. Most likely candidate:
https://bitbucket.org/mstrobel/procyon/wiki/Java%20Decompiler

Resource Types - Web

. Http search, follow links in html pages.
. Http search, support multiple resource list(s)

Resource Types - Maven, etc

. maven pom file, follow <modules> entries for child pom files
. gradle dependencies

Resource Types - File

. file search to support URL (file://..)
. MS docs - embedded docs (Word within Excel etc)
. file type specific settings, eg
zip.password=password for decrypting zip files
(doc,ppt,xls).password=password for decrypting a list of MS file types

Resource Types - Database

. Database column data containing URLs, option to follow these links.
. Database column data containing image data, apply OCR.
. database search, support multiple resource list(s)
. neo4j include relationship properties in search
. mongodb, other graph databases
. db search with >1 'tab.col'
. db query for collation rules for -i

Input / Environment / Command Line / Options

. More 'grep' features, -x (whole line match only), -s (no errors), -c (count)
from http://www.gnu.org/software/grep/manual/grep.html
. Some 'find' features (time based options, -type, -user)
. read from a file containing resource paths
. options to output to file or stream (for use as a library)
. inclusion/exclusion filters for resource lists
find like functionality
-r -include '.pdf' 'patten' dir/to/search
which will only look at pdf files buried inside stuff under dir/to/search
-r -l -include '
.pdf' '' dir/to/search
lists all pdf files found under dir/to/search. Alias to 'crgrep --find '
.pdf' dir/to/search ... ??
. add user and system .crgrep?
. .crgrep file type specific passwords eg 'docx.password = 1234'
. Use Grep diffs:
- grep uses "Binary file <f> matches"
- grep doesn't show line numbers by default. Requires -n/--line-number for line numbers.
- grep has -h to suppress filename on output, makes no sense for embedded grep. Doc 'diffs to normal grap'.
. --env to verify if all drivers, external libs/deps
. -l listing
- means match by name
- should mean list filename if contents match. Or have 2 options? 1: match name, dont open. 2: open, match data & if match show name only
- some results include content.
. when -r specified then '*.xml' will search inside archives for file name matches.

Output / Results / Formatting / Display

. swing gui, see https://sourceforge.net/projects/grepgui
. chained output filters for line wrap, output formats
. formatted output: html, reports, tree view of nested results
test.war
|foo.jar
|
src/com/bar
|__test.java
--> if (matched == true) {
. context around match. Need class to capture every line tested, keep the last 5 for example.
in ResourceMatcher?

Cleanup / Bugs / Issues / Refactor / Tech Debt

. move ocr/ libs into lib/
. fix -Xdebug to not display file listings not searched such as binaries. Add to trace instead.

Documentation

. Change all remaining docs to .md format
http://daringfireball.net/projects/markdown/
verify it displays correctly in bitbucket/sourceforge

Misc

. create github Pages
ref to this for ideas: http://thechangelog.com/top-ten-reasons-why-i-wont-use-your-open-source-project/

. deps uplift. Tess4j is now at 2.0 beta (Tesseract 3.03)
uplift to newer neo4j version
uplift postgresql to 9.4 etc
uplift sonatype to eclipse package
. gradle build system to replace pom.xml

Notes

FileResourceMatcher:

- interface to access all specified files
new (ResouceList, ExcludesGlob, IncludesGlob, isRecursive)

filters.addFilter(new FileFilter(excludeGlob)); 
filters.addFilter(new FileFilter(includeGlob));

for (res : resourceList.args) {
    // eg res = '**/?foo/*.tar'
    pathMatchers.add(new PathMatcher(res, isRecursive))
}

hasNext()
p = getPathMatcher()
return p == null ? false : p hasNext()
getPatchMatcher()
if currPath.hasNext()
return currPath
return nextPatchMatcher()
next()
p-> nextPath()
-> applyFilters
-> next File(..)