Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Extract Citation - warning if record unusable

2007-05-17
2013-05-28
  • Knut Krüger
    Knut Krüger
    2007-05-17

    It would be fin to get a very big warning in the reference list if any of the records is unusable, f.e with the wrong record type (submitted by Pubmed).
    Just now the record is missing witoout any warning:
    Your test Database:

    {12408} wrong media type
    {1807} 
    {22222} Record not availabel
    {1809}

    Result:

    (2004) Animal Abstract Ocelot       details 
    (2004) Diversidad ecologica y biologica de Mexico. Secretaria de Medio Ambiente, Recursos Naturales y Pesca, Instituto Nacional de Ecologia, Comision Nacional para el Conocimiento y Uso de la Biodiversidad, 1–295

    The result from 1807 is too less, but this is something what you could find with during a proof read.
    As a minor implementation maybe it is possible in a future release to give a hint for that.

    But 12408 and 22222 is missing with no hints and that is difficult even with a proof read.

    Regards Knut

     
    • Hi Knut,

      I understand that this is a somewhat annoying issue. I guess that you are using refbase-0.9.0 and not anything newer taken from our version control repository? I'm asking since recognition of PubMed types has been improved quite a bit in most recent modifications and should be much less of a problem in the future.

      In addition, I've modified all citation style files so that no records will be omitted if the resource type is unrecognized.

      I agree that some (optional) debugging info would be very useful upon citation output. Actually, I've had long-standing plans to implement something like this (dating back before refbase-0.6), but until now, other features were always considered more important/pressing. In a similar fashion I'd love to see dynamic verification of field contents in the "Add/Edit Record" form so that a user could be informed about incorrect data entry or about fields that are missing but are required for a given resource type.

      http://wiki.refbase.net/index.php/Planned_feature_additions#Input_validation

      As you suggest, it would also be possible to report record ID's that have no matching entry in the database. However, I don't see how refbase could detect an incorrectly assigned resource type. refbase could flag unrecognized resource types, but it would probably not be possible to reliably detect a record that has an incorrect (but known) resource type assigned to it.

      Matthias

       
      • Knut Krüger
        Knut Krüger
        2007-05-17

        > However, I don't see how refbase could detect an incorrectly assigned resource type.
        The most important thing during creating the reference list is to detect when nothing of the record ID is included, means when the record is empty. In this case the complete citation from an in the text used reference is missing.

        Instead to give back an empty data set the function could give back anything like

        !!!!!!!!!! Record Nr. XYZ not available !!!!!!!!!!!!

        That would be enough to realize that the citation got lost in the references.

        Do you think that this could be an improvement?

        Regards Knut

         
    • Knut Krüger
      Knut Krüger
      2007-05-17

      So when I am understanding you right it is impossible with the new files to lose records.
      If it is incorrectly assignet it is possbile to detect it during the proof reading.
      That's an important step because nobody could be sure that the records are complete if they are used the first time.

      And yes I am using 0.90 but I did not find anything newer.
      Is there a RSS-feed available or anything else to get aware about new files?

      Knut

       
    • > The most important thing during creating the reference list is to
      > detect when nothing of the record ID is included, means when the
      > record is empty. In this case the complete citation from an in the
      > text used reference is missing.

      Do you really mean an *empty* record, or that no record exists for the given serial number? If the latter, then yes, this would be possible to report and it was what I was referring to when I said:

      > As you suggest, it would also be possible to report record ID's that have no matching entry in the database.

      > So when I am understanding you right it is impossible
      > with the new files to lose records.

      No, I tried to say the opposite. ;-) To rephrase my earlier comment: In contrast to refbase-0.9.0, the updated citation style files will now output every record that exists in the database and which has a serial number that matches the list of extracted record serials. I.e., records of unrecognized record type are no longer omitted from the citation output, instead they are formatted similar to whole books (which is as good as we can do for records of unrecognized type).

      > If it is incorrectly assignet it is possbile to detect it during
      > the proof reading. That's an important step because nobody could
      > be sure that the records are complete if they are used the first
      > time.

      Sorry, I'm not sure what you mean by this.

      > And yes I am using 0.90 but I did not find anything newer.

      The newest modifications have not yet been released as a downloadable package. They are currently only available in the bleeding-edge branch of our online Subversion repository:

      http://refbase.svn.sourceforge.net/viewvc/refbase/branches/bleeding-edge/

      The modified citation style files (revisions 937 and 939) should work with refbase-0.9.0 (though I haven't tested this), so you should be able to download them and use them instead of the ones that come with refbase-0.9.0:

      http://refbase.svn.sourceforge.net/viewvc/refbase/branches/bleeding-edge/cite/styles/

      > Is there a RSS-feed available or anything else to get aware about
      > new files?

      I'm not aware of any RSS feeds for code additions or modifications. However, we maintain a read-only mailing list that reports any new commits:

      https://lists.sourceforge.net/lists/listinfo/refbase-commits

      And, of course, if you are familiar with version control (or are willing to learn about it), you could always checkout the full refbase source from our Subversion repository and follow code changes via any SVN tool.

      http://svn.refbase.net/
      https://sourceforge.net/svn/?group_id=64647

      Matthias

       
      • CIA syndicates RSS 2.0 feeds of our commits:
        <http://cia.vc/stats/project/refbase>
        There might be other ways to get an RSS feed about changes, but CIA works fairly well.

        --Rick

         
    • > So when I am understanding you right it is impossible
      > with the new files to lose records.

      Ah, I did read "possible" instead of "impossible", sorry for the confusion. So yes, you did understand me correctly. However, any serial numbers that were extracted from a text via 'extract.php' but have no matching record in the database will still get omitted. As mentioned earlier, I agree that it would be useful to inform the user about any unmatched record serials.

      Matthias

       
      • Knut Krüger
        Knut Krüger
        2007-05-17

        > sorry for the confusion.

        hmm I am always thinking that I am causing confusion ....

        Are you working in Germany?  Then it is very late for you ;-)

        Knut

         
    • Knut Krüger
      Knut Krüger
      2007-07-03

      Matthias wrote:
      >As mentioned earlier, I agree that it would be useful to inform the user about any unmatched record serials.

      could you tell me which function is looking for the serials or which function is extracting the records?

      I must find a solution for that , because I  deleted som dups, and maybe some of them are used in a word file.

      Regards Knut

       
    • Hi Knut,

      > could you tell me which function is looking for the serials or which function is extracting the records?

      Have a look at function 'extractFormElementsExtract()' in file 'search.php'.

      If you don't want to mess with that function, here's a workaround that let's you find all records that were cited in a word file but aren't available in your refbase database anymore:

      1. Go to Options > Cite Options, and enable the checkbox next to "Use custom text citation format"

      2. Edit your custom text citation format so that it just reads "<:serial:>" (w/o the quotes)

      3. Paste the contents of your word file into the form of 'extracts.php' and choose "Text Citation" as output style, then click the "Cite" button

      4. Copy the resulting list of serial numbers into a new text editor window

      5. Copy the contents of your word file into another text editor window and extract all serial numbers from the text via some text munging (see below)

      6. Use your text editor (or an app such as Excel etc) to sort both lists of serial numbers by number

      7. Use your text editor to remove any duplicate serial numbers from the serials list that originated from the word file's source text

      8. Use the "diff" feature of your text editor to compare both lists of serial numbers

      This should tell you those serial numbers that exist in your word file but which are missing in the database.

      Note that the "Extract citations" feature also works with user-specific cite keys (given in the 'cite_key' field). If identical cite keys are used for duplicate records, the deletion of dups shouldn't lead to missing records.

      To extract all serial numbers from the text of your word file, you could use a regular search+replace pattern similar to this one:

      search for:
      {(.+?)}.*?(?={.+?}|$)

      replace with:
      \01\n

      Note that this example assumes curly brackets as start/end delimiters (which enclose the serial numbers) and that it uses perl-style regular expression syntax, which you may need to adopt to the regex syntax used by your text editor. Also, some more text munging (such as merging multiple runs of newlines or removing words at the beginning of a paragraph) will be required to come up with a clean list of serial numbers.

      This does all sound more complicated than it actually is and I'm sure that the described steps could be combined into a little Perl/PHP/shell script rather easily.

      Hope this helps, Matthias

       
    • Knut Krüger
      Knut Krüger
      2007-07-04

      Hi Matthias
      the query string after
      'extractFormElementsExtract()' in file 'search.php' is containing all records. (mabye there is a better place where the serials still indiviual)

      What about a query for each record

      $select = mysql_query('SELECT serial FROM refs LEFT JOIN user_data ON serial = record_id AND user_id = 1 WHERE serial RLIKE "^(xxx)$"');
      $row = mysql_fetch_array($select);
      if(!($row)) { write serial in an array }

      and add this serials at the end of the references.

      Knut

       
      • > What about a query for each record

        Issuing a separate SQL query for every single serial number could mean an awful lot of queries, potentially several hundred queries per cite action, so that's probably not a good solution for identifying missing records upon citation output.

        However, I think that two SQL queries would be sufficient. The first query would retrieve a list of serial numbers for all cited records. This list of retrieved serial numbers could then be compared with the list of serials that were extracted from the text input (similar to the manual steps outlined in my previous post). The second query could retrieve formatted citations for every cited record that exists in the database, and missing record serials could be presented in a warning message upon citation output.

        Matthias

         
    • Knut Krüger
      Knut Krüger
      2007-07-05

      Hi Matthias,
      I did a test with 3000 queries after extractFormElementsExtract:
      (Ok the ouput will apear at the top for the test)
      $time1  = localtime(time() , 1);
      $errorstr = "";
      for ($serial  = 1; $serial  < 3000; $serial ++){
         $select = mysql_query('SELECT serial FROM refs LEFT JOIN user_data ON serial = record_id AND user_id = 1 WHERE serial RLIKE "^('.$serial.')$"');
         $row = mysql_fetch_array($query);
         if(!($row)) $errorstr=$errorstr."-".$serial;
      }
      $time2  = localtime(time() , 1);
      echo("----time 1 ".$time1[tm_sec]);
      echo("----time 2 ".$time2[tm_sec]);
      echo("-----".$errorstr."-------");

      The time was 12 sec on the very first and oldest server of my provider (build 2004)
      I do not know a paper with more than 300 records, so we are thinking about 1 sec or less on a faster machine.
      I think 1 sec. is worth for security not to forget a record in the references.
      And the errorstr reported all missing serials. Much less time as comparing text with references.

      Knut

       
    • Knut Krüger
      Knut Krüger
      2007-07-05

      I forgot to write:
      But now  I am thinking about your sugggestion -- I need a edit post function...

      Knut

       
    • Knut Krüger
      Knut Krüger
      2007-07-05

      and should be $row = mysql_fetch_array($serial);
      There was a power blackout for three seconds during writing, so I posted an old code
      Three post for one issue :-(

       
    • From my experience, manipulation of strings and lists is pretty fast in PHP, and since refbase is a multi-user database (with possibly many users accessing the database concurrently) we should try to avoid multiple SQL queries whenever possible. Since, in this case, we can easily reduce the number of queries from possibly several hundreds or >1000 down to two queries per cite action, it's no question what should be done.

      As outlined earlier, refbase should generate two arrays of serial numbers (one from the source text, one from all matching records in the DB), then compare these arrays, and, upon citation output, spit out a warning message containing those serial numbers that only occurred in the source-text array.

      I've added this to my ToDo list but I cannot promise a time schedule for its implementation.

      Matthias

       
      • Knut Krüger
        Knut Krüger
        2007-07-05

        Hi Matthias,
        I just implemented the foreach query as an workaround on the local machine.
        Its quick and easy but not the last choice I now.
        But we must submit  the paper. Ok I found the missing records, thats what I expected.

        The next step is your suggestion, should not be very difficult, but I do not agree completly.
        Building the references needs at my systems appr. 4 times longer than the multiple queries.
        The queries will only called if a user is building the references and if a couple of users are building the references simultanious the amount of CPUI use will be from the reference list not from the sql queries.

        Knut

         
    • Knut Krüger
      Knut Krüger
      2007-07-05

      By the way I changed only the cite_ascii.php
      Cou could test it out if you want with Extract Citations - return as SCII
      Records 1..13 are deleted so you should add one of them to the inline citation list.

      Maybe you could add this workaround for all, unitl there is time for the rest.

      Knut

       
    • Knut Krüger
      Knut Krüger
      2007-07-05

      got it with one query, but just now only in cite_ascii.php
      I am sure you know whats the best place to add the error into all reference sets

      Do you would like to get the file?

      function findMissing( $userID)
          {
      # snip   
      .... build the query like inextractFormElementsExtract
      # end snip

          $result_input = mysql_query($query);
          $i=1;
          $row_array =array();
          while($row = mysql_fetch_array($result_input))  #don`t know why in_arry did not work with $row with in_array
          {    $row_array[$i] = $row[serial];
              $i++;
          }
          $errorstr="!! Missing records !!: ";
          foreach($recordSerialsKeysArray as $recordSerialKey)
                if (!(in_array($recordSerialKey, $row_array)))
                      $errorstr.="-".$recordSerialKey;
                       
          return $errorstr;
             
      }# end    function findMissing( $userID)

       
      • I don't think that it's necessary to alter the 'cite_*php' files. I've uploaded a modified version of 'search.php' to the bleeding-edge branch of the refbase SVN repository:

        http://refbase.svn.sourceforge.net/viewvc/refbase/branches/bleeding-edge/search.php?view=log

        The function 'extractFormElementsExtract()' now checks whether the extracted record identifiers actually exist in the database and report any missing record identifiers within a warning message at the top of the citation list. Note that this works for serial numbers as well as for cite keys.

        The current revision (#976) should work fine with a refbase version from the SVN trunk. If you're using refbase-0.9.0, then just replace function 'extractFormElementsExtract()' in 'search.php' from the 0.9.0 release version with the one from the modified 'search.php' file.

        Matthias

         
    • Knut Krüger
      Knut Krüger
      2007-07-05

      Nice - its working now in HTML mode and its a great help.

      There is an PHP warning for each found record:

      Warning: Missing argument 8 for printLinks() in C:\daten\apache\htdocs\Web\uni\refdb\search.php on line 5677

      function printLinks($showLinkTypes, $row, $showQuery, $showLinks, $wrapResults, $userID, $viewType, $orderBy)

      Knut

       
      • Hi Knut, I'm glad that the feature is working for you.

        W.r.t. the PHP warning message, please make sure that you've updated all files in 'cite/formats/' to their most recent revisions, especially file 'cite_html.php' which makes use of the 'printLinks()' function. You'll probably need the revision from the bleeding-edge branch if you'd like to run this feature without problems.

        Matthias