RE: [Tsaphan-developers] Verse Diff program

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Some thoughts...

Should there be a score or should it be pass/fail?  This is the Word of God,
you know.  Is anything less than perfect acceptable?

How about a visual summary, showing the user's entry compared to the correct
verse.  All errors could be highlighted.  This might simplify the task of
specifically identifying what words were out of order, what words were
omitted, etc.

==========================================================================
Your entry:
All scripture is *insired* by God, * *profitable for teaching, *for
correction, for reproof*, for training in righteousness; so that the man of
God 
may be adequate, equipped for every good work.

Verse:
All scripture is inspired by God, and profitable for teaching, for reproof,
for correction, for training in righteousness; so that the man of God 
may be adequate, equipped for every good work.

===========================================================================

The highlights (represented by *) reveal that:
- "insired" is a spelling error.
- " " before profitable shows "and" was omitted.
- "profitable...reprof" is out of order.  "for training" is where the order
resumes correctly.

Different colored highlights can be used, too.

Does this help?

Nick

-----Original Message-----
From: Patrick Lacson [mailto:pa...@la...]
Sent: Wednesday, April 18, 2001 2:22 PM
To: Mike Lucas
Cc: htw-list
Subject: [Tsaphan-developers] Verse Diff program

Mike,

I'm ccing the htw list because we need everybody's feedback on this Diff
algorithm.  Also the basic requirement needs feedback:

1)  Compare 2 String types
2)  Allow comparison preference level (check for punctuation,
CaPITilaZaTion, etc.. -- how accurate)
3)  Compute the percentage based on how accurate/inaccurate the
attempted verse is vs. the actual verse

So here's a suggested test case (for 2Tim 3:16-17)
##################################################################
All Scripture is inspired by God and profitable for teaching, 
for reproof, for correction, for training in righteousness; so 
that the man of God may be adequate, equipped for every good work. 
##################################################################

Attempted verse:
----------------
All scripture is insired by God, profitable for teaching, for
correction, 
for reproof, for training in righteousness; so that the man of God 
may be adequate, equipped for every good work.

RESPONSE:
----------
a)  Mispelled -- "insired"
b)  Incorrect -- "God, profitable for teaching, for correction, for
reproof, "
c)  Score is 70%

My basic approach would be to use 2 arrays and tokenize the 2 strings
into the arrays:

actual_arr[0] = "All";
actual_arr[1] = "Scripture";
actual_arr[2] = "is";
actual_arr[3] = "inspired";
actual_arr[4] = "by"
...

attemp_arr[0] = "All";
attempt_arr[1] = "Scripture";
attempt_arr[2] = "is";
attempt_arr[3] = "insired";  // red flag here for misspelled
...

// continue to process

Compare the 2 arrays (attempted/actual) for word matches, mispelled
words, punctionation marks.  So familiarize yourself with the
StringTokenizer class and the Diff algo in the jcore.utils.Diff
package.  This may not be the best way to do this, but this atleast
allows some easy answers right off the bat regarding the accuracy of
their verse attempt, missing words/punctuation (basically any token),
non-matching words.

However the algo *may* get lost from a few missing words, so we have to
make it smarter in figuring out where the remaining words are.. this is
where the real challenge of this diff algo lies: maintaining context and
pattern matching (via regular expressions??).

So think about this approach and let me know what pros/cons you see --
shoot your ideas out and let me know cuz I need as much feedback from
everybody on this as I can.. Here's some questions I had to ask myself
about this design:

1) does it make sense to use 2 token-arrays for comparison
2) how do we maintain context if the 2nd array is missing words, how do
we catch up to the original
3) Should we even do this approach?
4) Are there other systems out there that have a text parser already
available that we can reuse
5) How should we score the attempt?? amount of words, mispelled words,
missing words, etc..

Sorry if I'm being a bit verbose, but I'm very excited that we have
another developer on the squad to help us out with this..  I'd like to
bounce all ideas to the list and get everybody involved developer or not
just to see if things are making sense -- I tend to think too short-term
and neglect long-term implications.

-P

Michael Lucas wrote:
> 
> Pat,
> 
>         I've done a limited amount of network programming and know nothing
> on threads.  I don't mind trying to work on it, if you are not too
> constrained on time while I learn about it.  Either project is good for
> me.
> 
>                                         - Mike

_______________________________________________
Tsaphan-developers mailing list
Tsa...@li...
http://lists.sourceforge.net/lists/listinfo/tsaphan-developers

***********************************************************************
This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom they   
are addressed. Any unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original 
message. 
***********************************************************************