Menu

powershell ie automation instead of webclient

Developers
ivan lim
2009-06-07
2012-11-17
  • ivan lim

    ivan lim - 2009-06-07

    2009-06-07

    on the day Tommy is cut loose .. i feel your pain man .. and the kid replacing you gives up how many homers?  same as strikeouts too.... oh youth...

    anyway on that day the game is rained out and i look at the rs error messages and trace it to a javascript error page when the webclient hits the url in the script.  Or in this day and age of ajax and javascript using the System.Net.WebClient ain't the same webscrapper it once was.  So like any scripter i think ... ok IE automation and powershell here we come...

    4 days later..

    1) webbrowser view source in ie8 pretty prints the html ... and it is completely different from $oIE.document.body.innerHTML that the script is processing. Look at the actual data stream to regex against, not the pretty print version.  I was about to compose a long ode to Tommy on paradim shifts in regex cause all the regex stuff kept generating nonsense.  I thought perl regex had sudden left the building..where is elvis?  And you would think with |gm showing 100's of events/methods/properties to query in the $oIE object that a simple .GetTagByID would make things simple... i just gulped the entire correct html as string and regex against that.

    2) $str = $oIE.document.body.innerHTML would return blank, but $oIE.Visible=$True would render the correct page.  Sometimes it would work, sometimes it would fail.  I thought it was like the proxy cache time out problem that plagues programmatic access to web services....  is there a property/event for completed rendered page?  cause Visible will pop the correct url page, but the code is null string.

    3) Start-sleep sort of works.  In the console the default is Start-Sleep -s X or seconds... in a script it is apparently milliseconds one day, and the previous day it is seconds.  And one day [string] automagically converts out-host fine, the next day it makes the wrong type conversion...

    4) the entire replacement script is about 200 lines with moderate error checking.  The vs2k5 project it replaces is 400k winform webscraper which worked fine until the target pages changed over to javascript detection and redirection to an error page .. this happened around the time ie8 was forced out so lots of changes in browser detection.  So how long will this powershell script last?  the general problem of ajax and javascript rendering is a one's twoees thing...like fixing perl scripts.  But then your only looking at 200 lines of script code.

    more later..
    ivan

     
    • ivan lim

      ivan lim - 2009-07-01

      2009-07-01

      well since i have a powershell thread here i'll dump this here...decent baseball too...

      set-content and add-content are way to slow for gig size log files ... so i google and i really wonder if bing has made them find new life...cause getting real search hits instead of nonsense retailing ad clicks now..  anyway the ms support guys in the forums are all falling back to netframework stuff when this issue comes up and right now it is about 20x faster in the current running script:
      ...
      $fileIN = Get-ChildItem $args[0]
      $presentworkingdirectory = Split-Path -parent $MyInvocation.MyCommand.Definition
      $fileOUT = $presentworkingdirectory +"\" + "AllPerm.txt"
      ...

      $f = [System.IO.File]::OpenText($fileIN.FullName)
      $fileOUT
      #    $fout = [System.IO.File]::Createtext("$fileOUT")
      #$fout.Close()
      #exit
      if( ![System.IO.File]::Exists($fileOUT) )
      {
          $fout = [System.IO.File]::Createtext($fileOUT)

      }else {
      $fout = [System.IO.File]::OpenText($fileOUT) }

      #$fout = [System.IO.StreamWriter]::New($fileOUT)
      while(!$f.EndOfStream)
      {
          $line = $f.ReadLine()
          $line1rule = rule1permute($line)
          #add-content $fileOUT $line1rule
          $fout.Write($line1rule,0,$line1rule.Length)
         
          #$line1rule
          #$fout.WriteLine($line1rule)
          #exit

      }
      $f.Close()
      $fout.Close()

      lots of path and type stuff but just follow c# help examples ...streamreader/writer is a default class and just system.io.file::opentext does work..and if you lock the file in a create you got to kill the ps session to release the file.

      i'll bench mark at the endofday run .. but right now looks like it is 20x faster..

      on a pthread note .. the q9000 box is able to generate 120 cmd threads twice as fast as a p7350 duo on the laptop .. both vista64 sp2 builds..but with 3 job scripts running the stackfaults in the cygwin sessions are much more frequent..  The cygwin build just uses sleeps around lock/unlock code to get some stablility...

      more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-02

      2009-07-02
      ok real world script runs 12 hours generating a 9 gig text file .. and it is a little less then 10x faster then using add-content.  So not the 20x i was eyeballing in the last note when i started the script run.  Run on a laptop w/ a 7200 drive and 50 gig free space so i assume not a lot of thrashing of the drive, and task manager only shows 50% cpu. 

      All the pthread scripts run at 100% cpu but do little io.  But two jobs only did 420k combinations each on a quad9000 box.  A single pthread job runs over a million combinations in 12 hours, so generating 120 cmd threads did not get me double the thru-put, it was slightly worse then running 60 cmd threads via the pthread calls. And you have to wonder if an i7 box which is hyperthreaded to get the 4 other processors would run into the same thread scheduling thru-put barrier.

      more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-07

      2009-07-06
      strange google.code page w/ apparently homegrown video js is crashing ie8 all the time now...i had a long note on the isbnextractor ieBookCmdlet and went to grab the url to paste and for the 4th time in a week, ie8 crashes horribly just rendering/refreshing some silly video ajax frame sitting in the google code page.  a chromeing we go?  or wait for them to notice.  This summer of code stuff is filling google with lots of bugs, and i'd say it is following the exact same track of arrogance as microsoft...  a generation of admins who think nixon is a baseball player..or michel is who??

      i'll add the isbn cmdlet later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-09

      2009-07-08

      http://code.google.com/p/isbnextractor/

      i wonder if the north koren hackers got to google videos...  cause it's been crashing for a week now..

      anyway my perlly prose is off into the ether..but i got to fix up the script a bit.
      The script itself which calls get-isbnInfo cmdlet is necessary because of full path bugs/gotchas.  I tried to put the path info into the cmdlet so that a ps1 script would have simple syntax but .. more time trying to fix permissions and full paths in c# then in ps .. so the ps1 is:
      # need to update _0_summaryXML.lst with
      # nop use ps1 not commandline cmdstuff:  dir /w /b c:\scripts\isbnDB > c:\scripts\isbnDB\_0_summaryXML.lst
      # which is a crude way of catching duplicates that use up the isbndb.com connection limit
      # out of cmd?

      gci c:/scripts/isbndb/*.xml | Select-Object -Property Name > c:/scripts/isbndb/_0_summaryXML.lst

      gci -Recurse $args[0] | foreach -process {$b = Get-IsbnInfo -Isbnpdfchmparse $_.FullName;
          if( $b -ne $null ) { $i = get-isbninfo -Isbndbisbnlookup $b; $i}
           else { write-host("skipping..")} }

      and the webscrapper/isbnDB api xml calls and the pdfbox and isbnextractor stuff are in the cmdlet which has a process record fragment like:
                  switch (ParameterSetName)
                  {
                      case "isbndb":
                          WebScrappers isbnWB = new WebScrappers();
                          object xmlrtn = isbnWB.findTitleSearchISBNdb(Isbndb);
                          WriteObject(xmlrtn);
                          break;
                      case "isbndbisbnlookup":
                          WebScrappers isbnWB2 = new WebScrappers();
                          if (isbnxmlExists(Isbndbisbnlookup) )
                          {
                              // then skips this entry
                              WriteObject(Isbndbisbnlookup + " already exists in c:\\scripts\\isbnDB so skip lookup");
                          }
                          else
                          {
                              object xmlrtn2 = isbnWB2.findBookXMLSearchISBNdb(Isbndbisbnlookup);

                              WriteObject(xmlrtn2);
                          }
                          break;
                      case "isbnpdfchmparse":
                          // have to have full path to the filename not ./xxx otherwise get null ref and log_parser.txt in home dir shows the error msg could not find file
                          ResultISBN isbn = new ResultISBN("", "");
                          Isbnpdfchmparse.Replace(".\\","");  // powershell appends .\ to filenames when you tab enter defaults
                          //string cwd = Assembly.GetExecutingAssembly().Location;
                          // no this only gets you the path to the powershell module assembly launched
                         
                          System.IO.FileInfo fi = new System.IO.FileInfo(Isbnpdfchmparse);
                          string cwd = Directory.GetCurrentDirectory();
                          // no this still defaults to userid home dir
                          string fullpathfilename = cwd + "\\" + Isbnpdfchmparse; //Path.GetFullPath(Isbnpdfchmparse);
                          // no this still just gets userhome dir so that fi.FullName is nonsense
                          if ( !Isbnpdfchmparse.Contains(":")  )
                          {
                              System.IO.FileInfo fi2 = new System.IO.FileInfo(fullpathfilename);
                              if (fi2.Exists)
                              {
                                  WriteObject("working on :" + fullpathfilename + " error log at ~log_isbnParser.txt");
                                  isbn = GetDocumentISBN(fullpathfilename);
                              }
                              else { WriteObject("you still have bad path resolution, so pass the full correct path to the file:" + fullpathfilename); }
                          }
                          else { isbn = GetDocumentISBN(Isbnpdfchmparse); }  // already passed a full path string
                          //this is only typestring header WriteObject(isbn.ToString());
                          WriteObject(isbn.Isbn.ToString());
                          //this is entire block read in: WriteObject(isbn.Result.ToString());
                          // so if your looking for LCC or title or author list .. better to get the isbn, and look that up rather then
                          // parse the 10 pages looking for boiler plate
                         

                          break;
                      default:
                          break;
                  }

      which in ctp3 thankful just needs to be put into wps\Modules\iebookcmdlet and then an import-module iebookcmdlet in my profile.

      now are all these dll's kosher?
      10/12/2006  12:20 PM           151,552 bcmail-jdk14-132.dll
      10/12/2006  12:20 PM         1,187,840 bcprov-jdk14-132.dll
      07/06/2009  06:25 PM                 0 dir.lst
      10/12/2006  12:20 PM            86,016 FontBox-0.1.0-dev.dll
      07/06/2009  06:24 PM            23,552 ieBookCmdlet.dll
      07/05/2009  07:57 PM            34,304 ieBookCmdlet.pdb
      08/10/2006  10:17 AM         9,568,256 IKVM.GNU.Classpath.dll
      08/10/2006  10:14 AM           344,064 IKVM.Runtime.dll
      10/12/2006  12:20 PM           380,928 lucene-core-2.0.0.dll
      10/12/2006  12:20 PM            81,920 lucene-demos-2.0.0.dll
      10/12/2006  12:20 PM         4,653,056 PDFBox-0.7.3.dll
      11/06/2003  05:13 PM            24,576 RelatedObjects.Storage.dll
                    12 File(s)     16,536,064 bytes

      well it took about half a baseball game to drop Teo's isbnextractor code pieces into a powershell cmdlet wrapper.  It took another baseball game to fix all the path errors and i finally let gci do it all instead of trying to load each fullpath..

      gci  very nicely passes full path names
      adding to the profile

      Import-Module ieBookCmdlet
      "use myfp to pass full path to Get-IsbnParser"
      function myfp([string] $shorttargetfilename)
      {
          $cwd = Get-Location | Select-Object -ExpandProperty Path
          $targetfilename = $cwd + "\" + $shorttargetfilename
          [System.IO.FileInfo] $fi = new-object System.IO.FileInfo($targetfilename)
          return [string]$fi.FullName
          #$f = Get-ChildItem -Path $shorttargetfilename
          #return $f.FullName
      }

      does not work so well   so stay with the gci syntax and syntactic shift...

      proofing the old notes...the box is a Q9300 which will do 1.6 million combinations in a day's run of the pthread script at 30 threads each 2 separate jobs/sessions.  So it spawns 60 cmd threads.  If i do a single 60 thread session i get less then 1 million, if i do two 60 thread sessions i get 800k combinations .. or even slower.

      i powered up the quad6600 box to see if it was any different...and thru-put is about same given 2.4ghz vs 2.6ghz...

      looking at an i7 -920 box and the hyper-threading processors don't seem to leverage pthreads anywhere close to a full core...but i haven't done a full test.
      my p4-3.4ghz laptop runs a single thread as fast as the q9300 single thread sessions..

      and of course sleep lock do unlock sleep is hardly fancy pthread scheduling either..

      more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-10

      2009-07-10
      i'm amazed but jobmonitor.ps1 actually works with a flaky cygwin app where the pthreads crash the cygwin stack/generate access violations etc
      and the logic is screwy too ...
      ./xx args1 args2
      do
      {
      do{}
      while ( Get-Proces xx } | select -Property Responding )
      ./xxx args1 args2
      }while (1)

      and if the cygwin dll's are in the same dir for the app this will relaunch the cygwin app and it doesn't seem to corrupt the run env....runs all day long crashing about every 10 minutes or so on average..but it runs.

      the alternative forms
      &( $args[0] + " " + $args[1] )
      $jobprocess = [Diagnostics.Process]::Start('$args[0] + " " + $args[1]')
      do
      {
      do{} while ( !$jobprocess.HasExited)
      $jobproces = [Diagnostics.Process
      ./xx xx
      } while (1)

      has lots of fix ups .. but just ./xx args works when i thought it wouldn't  &xx or using the start has lots of args processing to debug...

      i'm amazed the first fragment works .. in win7rc x64 by the way. on a quad6600

      the remoting versions and process start to hide the cmd window and logging of course is the next version but the simple little do loop nested w/ cygwin dll's in the same directory has been running all day long now w/o corrupting the run env.  so jobcontrol creates the job directories, you copy job dirs to the clients and the clients just run ./jobmonitor.ps1

      i even have a register-wmievent version sketched out .. to poll and restart until this do loop while fragment worked.  oddity is that while(get-process xxx  works within the while but out side of it you get an error thrown that the process doesn't exist anyway.. some odd syntatic sugar...

      more later..
      ivan

      more later..
      ivan

      }

       
    • ivan lim

      ivan lim - 2009-07-11

      2009-07-11
      trying to get persistent data state in a c# cmdlet vs a simple ps1 script:

      lookupisbn.ps1
      # need to update _0_summaryXML.lst with
      # nope use ps1 not commandline cmdstuff:  dir /w /b c:\scripts\isbnDB > c:\scripts\isbnDB\_0_summaryXML.lst
      # which is a crude way of catching duplicates that use up the isbndb.com connection limit

      gci c:/scripts/isbndb/*.xml | Select-Object -Property Name > c:/scripts/isbndb/_0_summaryXML.lst
      # Import-Module ieBookCmdlet is already in my profile.ps1

      gci -Recurse $args[0] | foreach -process {if (( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$") )
          { $blast = $b; $b = Get-IsbnInfo -Isbnpdfchmparse $_.FullName; write-host $b + "found..working..";
          if( $b -ne $blast ) { $i = get-isbninfo -Isbndbisbnlookup $b;  $i}
           else { write-host("skipping..")} }
          }
      # -ne $null is not the same as a thrown object ref not an instance of an object
      # or in the pipe instead of do  if block |where-object { ( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$")  } |
      # bugs 1   java.lang.Throwable: Warning: You did not close the PDF Document
      #        at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:419)
      # causes $b to be the same pdf isbn and you end up probing the same number on isbnDB.com .. so 500 probes but only 140 actual isbns
      # bugs 2 adding close does nothing to error
      ===================

      the simple $blast = $b .. then if $b -ne $blast works
      The problem:  get-isbninfo -isbnpdfchmparse does not return or find an isbn number.
      pdfbox apparently does not close the document ... and $b is not null, it is the last run isbn.. which then passes to the next block which does a duplicate isbndb.com lookup eating up an extrac connection against my 500 limit....

      i spent a day thinking how to do this in c# ieBookCmdlet reading the file, the database, looping back an error code.  the doc.Close() does not get pass the java.lang.Throwable error.  This is the pdfbox-0.7.3 release from sourceforge, not the latest from apache incubator..  And this error looks to be just spurious...

      cause the $blast = $b ... then $b = .. then if $b -ne $blast ...does it all in a simple script wrapper.

      design wise it would be better to input into the mysql table here, and compare from the same table.  right now i do a batch lookup daily against the library directories on the drive (no longer dvd media changers which are way too slow) write out an xml file.  Then later do a batch load of the xml files into mysql tables.  And that design came about because there was just a manual heurestic match of the title with the isbn.  The idea of extracting the isbn from the pdf/chm itself comes from Teo's isbnextractor project on google project hosting.  These croatian/polish guys think differently...

      i'm reminded of a friend from the old berkeley days .. his major was library science and he was one of the weirdest guys i knew...and this was a class of engineering types.  Why dusty books when you could have fancy computers?   30 years later i crack open an ebook and i'm dependent on an isbn number, lccn, dewy_decimal catalog system to get any order .. and all that is created manually....

      gee whiz... more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-15

      2009-07-14  umm all star game...

      loadxml2mysql6.ps1

      # gci<dirtarget or xmlfile> | foreach -process { ./loadxml2mysql6.ps1 $_.FullName}
      # 2009-07-13
      #        /// mysql6 has a load xml cmd .. but 6 has been pulled from GA by the Oracle/Sun guys in early 2009 w/ a maybe late 2009 GA or not
      #        /// mariaDB still has a 6.0 track but doesn't look to be very active at this time

      $xmlfile2load = $args[0]

      [void][system.reflection.Assembly]::LoadWithPartialName("MySql.Data")

      # Open Connection
      #$connStr = "server=127.0.0.1;port=3306;uid=root;pwd=password;database=test;Pooling=False"
      $connStr = "server=127.0.0.1;port=3306;uid=<putinyouruserid>;pwd=<putinyourpwd>;database=ivanschanger;Pooling=False"
      $conn = New-Object MySql.Data.MySqlClient.MySqlConnection($connStr)
      $conn.Open()

      # Create a MySqlCommand and MySqlDataAdapter object
      $sql = "LOAD XML INFILE '"+ "$xmlfile2load" + "' REPLACE INTO TABLE BOOKISBN ROWS IDENTIFIED BY '<BookData>'"
      #$sql = "show tables;"
      # grant file is a global permissions not just on a single database
      # for security on a box the server can only write to installation locations not willy nilly anywhere
      # on the file system .. which sort of makes sense... so a ps scripting solution?  copy the xml to the
      # s:\mysql6\data\data\ivanschanger location .. do the load xml infile statement and then delete the file...yukkiee...
      #$args[0]
      #"copy-item appears to be broken"
      #copy-item $xmlfile2load -destination "s:/mysql6/data/data/ivanschanger/" -recurse -force
      # literate string in mysql uses \ as escape char which gci will pass as full name and a sql ' breaks the sql string and you have to \\ or null it ..
      # i null it out habit but your actually trying to load that file from the file system so need to escape it
      $xmlfile2load = $xmlfile2load.Replace("\","/")
      $xmlfile2load = $xmlfile2load.Replace("'","\'")
      $xmlfile2load = $xmlfile2load.Replace("``","\``")  # i don't like backtick cause it is a ps escape
      # so far in 2300 xml files there are still about 20 bad file names  trademark etc  and of course this
      # is not unicode portable at all
      $sql = "LOAD XML INFILE '" + $xmlfile2load + "' REPLACE INTO TABLE BOOKISBN ROWS IDENTIFIED BY '<BookData>'"
      $sql
      $cmd = New-Object MySql.Data.MySqlClient.MySqlCommand($sql, $conn)
      $cmd.ExecuteNonQuery()

      $conn.Close()
      $cmd.Connection.close()

      ==================================

      i had to manually create the the bookisbn table with the property col .. some of the examples show row xml syntax creating the columns .. and i assume the mssql xml import would do the columns for me too later..  and since i use this script run all the time syntax is "replace into"

      gotchas:
      1) have to grant global permissions for FILE to the userid, otherwise can't load infile
      2) i thought the linux security settings on writing reading would apply to just the install mysql location, but after a day of banging on it ... it will read from anywhere.  Like it followed the linux security setting for the first few attempts then broke down and allowed reads from anywhere which is why the code fragments for copy-item are commented out
      3) copy-item completely broken on my box .. some profile setting or namespace clash i think.
      got one -confirm dialog and then everything after that fried.

      The script source mentions the problem with mysql6 .. there is a load xml script for mysql5.1 ga which is a lot more work and since i am using mysql6.011alpha i'm happy to use the one-liner.  I have a powershell cmdlet to do the same thing and might use that to attack the unicode problem with file names .. the powershell script does 99% of what i want for now.

      more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-18

      2009-07-18

      added this fragment to the iebookCmdlet and a major pain:
                      case "parseisbnchmAndLookupIsbnDB":
                          // having to separate fn's made sense with ps code of
                          /*
                           * gci -Recurse $args[0] | foreach -process {if (( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$") )
          { $blast = $b; $b = Get-IsbnInfo -Isbnpdfchmparse $_.FullName; write-host $b + "found..working..";
          if( $b -ne $blast ) { $i = get-isbninfo -Isbndbisbnlookup $b;  $i}
           else { write-host("skipping..")} }
          }
              very cute but inflexible workaround from finding full paths which really needs to be solved within the cmdlet rather then depending
                           * on the ps1 script in conditioning the paths correctly. And the duplication problem is not minor
                           * so this is a refactor rewrite of the 3 separate functions of
                           * 1) find parse isbn out of a pdf/chm file ala Teo's isbnextractor code
                           * 2) test against database and the xml file dump location of c:\scripts\isbndb .. which can grow to 1000's of files
                           * 3) if not in the database text dir or mysql then do a isbndb lookup
                           * 4) write out the xml to the c:\scripts\isbndb dir, load into the database, copy to the targetload location with
                           * the naming convention of the input title instead of the corrected isbn derived title, copy a 10 page dump to the same location
                           * for future indexing and clips .. although gds indexes pdf's sort of..
                           * 5) idea is to put notes into the xml file at the same location ala irex/iliad's manifest file.  calibreDB location is
                           * dir based and authors is not really my prime dir choice.  but the ieBook app in VMC will be doing a mysql database of
                           * the metadata and preview/reader from the dir locations  the zen folder idea in gds..
                           */
                          //1 determine if file or directory so that you don't do stuff like gci -Recurse <> | foreach -process { dah dah dah
                          // make local to loop to clear the value in the loop .. ResultISBN isbnall = new ResultISBN("", "", "","");
                          System.IO.FileInfo fiFullPathall = new System.IO.FileInfo(ParseisbnchmAndLookupIsbnDB);
                          DirectoryInfo targetdir = new DirectoryInfo(ParseisbnchmAndLookupIsbnDB);
                          if (targetdir.Exists)
                          {
                              WriteObject("found dir instead of single input file " + ParseisbnchmAndLookupIsbnDB.ToString() + " so recursive walk of dir " );
                              //return; // it looks like the dll is corrupted with a new obj with in a class cyclic ref
                              foreach (FileInfo fi in targetdir.GetFiles())
                              {
                                  if ( (fi.Extension == ".pdf") || (fi.Extension == ".PDF") || (fi.Extension == ".chm") || (fi.Extension == ".CHM") )
                                  {
                                      WebScrappers isbnWBall = new WebScrappers();
                                      isbnExtractor ieo2 = new isbnExtractor();
                                      object xmlrtnall = null;
                                      ResultISBN isbnall = new ResultISBN("", "");
                                      WriteObject("working on " + fi.FullName);
                                      try
                                      {
                                          //isbnall = ieo2.GetDocumentISBN(fi.FullName.ToString());
                                          isbnall = GetDocumentISBN(fi.FullName);
                                          //WriteObject("getdoc->parseISBNwithPDFBox call throws null ref when it can't find an isbn");
                                      }
                                      catch (Exception e) { WriteObject(e.Message.ToString()); isbnall.Isbn = "" ; }
                                      //if (isbnall.Isbn.Length < 1)  // throws nullrefexception
                                      if (isbnall.Isbn == null )
                                      {
                                          WriteObject("no isbn found");
                                      }
                                      else
                                      {
                                          WriteObject("found " + isbnall.Isbn.ToString());
                                          if (isbnxmlExists(isbnall.Isbn.ToString()))
                                          {
                                              // then skips this entry
                                              WriteObject(Isbndbisbnlookup + " already exists in c:\\scripts\\isbnDB or it exists in mysql table so skip lookup");
                                              // but i want to read that xml file and copy it to the location
                                          }
                                          else
                                          {
                                              xmlrtnall = isbnWBall.findBookXMLSearchISBNdb(isbnall.Isbn.ToString());
                                              //System.IO.FileInfo fi = new System.IO.FileInfo(f);
                                              string fipathdir = fi.DirectoryName; // Isbnpdfchmparse.Substring(pathpart + 1);
                                              string finameTitle = fi.Name; // Isbnpdfchmparse.Substring(Isbnpdfchmparse.Length - pathpart);
                                              finameTitle = finameTitle.Replace(" ", "_");
                                              finameTitle = finameTitle.Replace("<", "");
                                              finameTitle = finameTitle.Replace(">", "");
                                              finameTitle = finameTitle.Replace("/", "");
                                              finameTitle = finameTitle.Replace(":", "");
                                              finameTitle = finameTitle.Replace("\"", "");
                                              string finameTitleOnly;
                                              if (finameTitle.Length > 4)
                                              {
                                                  finameTitleOnly = finameTitle.Substring(0, finameTitle.Length - 4);  // chop off the file extension a 3 char ext and dot
                                              }
                                              else { finameTitleOnly = finameTitle; }
                                              XmlWriterSettings settings = new XmlWriterSettings();
                                              if (isbnall.Result == null)
                                              {
                                                  WriteObject("no result page for isbn pdf/chm");
                                              }
                                              else
                                              {
                                                  string file10page = fipathdir + "\\" + finameTitleOnly + "_[" + isbnall.Isbn.ToString() + "]_10pgs.txt";
                                                  FileInfo t = new FileInfo(file10page);
                                                  StreamWriter Tx = t.CreateText();
                                                  Tx.Write(isbnall.Result.ToString());
                                                  Tx.Close();
                                             
                                                  string xmlfile = fipathdir + "\\" + finameTitleOnly + "_[" + isbnall.Isbn.ToString() + "].xml";
                                                  XmlWriter tmpxml = XmlWriter.Create(xmlfile, settings);
                                                  //tmpxml.WriteRaw(isbn.Isbn.ToString());  // does writeraw add it's own xml header?
                                                  tmpxml.WriteRaw(xmlrtnall.ToString());  // does writeraw add it's own xml header?

                                                  // want to write this same file to the location of the pdf/chm
                                                  tmpxml.Close();
                                              }
                                          }
                                      }
                                  }

                              }

                          }
                          else
                          {
                              WriteObject("did not find a directory so working on a file " + ParseisbnchmAndLookupIsbnDB.ToString());

                          }

                          break;

      =============
      problems:
      1) when a ps script runs get-isbn and it throws a nullrefexception it just goes to the next file in the gci wrapper.  When you try to do everything within the cmdlet itself any single error stops the entire run.
      2) so fix up the errors.. and bugs are everywhere trying to fix conditions where found, not found, error in title, error in isbndb xml returned etc etc .. or a mountain of bugs to track down when the simple ps1 script gci wrapper just skips over the errors.

      lookupisbn.ps1 ================
      Record-Session
      $cmdlinelogger = Get-Date -f "yyyyMMdd@HHmm"
      $lookuplog = "c:/prod/lookupIsbn_" + $cmdlinelogger + ".log"
      $lookuplogcontent = $cmdlinelogger + ":./lookupisbn.ps1 " + "$args"
      add-content $lookuplog $lookuplogcontent
      # need to update _0_summaryXML.lst with
      # nop use ps1 not commandline cmdstuff:  dir /w /b c:\scripts\isbnDB > c:\scripts\isbnDB\_0_summaryXML.lst
      # which is a crude way of catching duplicates that use up the isbndb.com connection limit

      gci c:/scripts/isbndb/*.xml | Select-Object -Property Name > c:/scripts/isbndb/_0_summaryXML.lst
      # Import-Module ieBookCmdlet is already in my profile.ps1

      gci -Recurse $args[0] | foreach -process {if (( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$") )
          { $blast = $b; $b = Get-Isbn -Isbnpdfchmparse $_.FullName; write-host $b + "found..working..";
          if( $b -ne $blast ) { $i = get-isbn -Isbndbisbnlookup $b;  $i}
           else { write-host("skipping..")} }
          }
      # -ne $null is not the same as a thrown object ref not an instance of an object
      # or in the pipe instead of do  if block |where-object { ( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$")  } |
      # bugs 1   java.lang.Throwable: Warning: You did not close the PDF Document
      #        at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:419)
      # causes $b to be the same pdf isbn and you end up probing the same number on isbnDB.com .. so 500 probes but only 140 actual isbns
      # bugs 2 adding close does nothing to error
      Stop-Transcript

      =============added log, transcript running in profile changed name from Get-ISBNINFO to Get-ISBN but otherwise runs much better doing ./lookupisbn.ps1 <targetdir>
      then a single Get-ISBN <targetdir> which can't handle the simplest error.

      next problem is debugging .. and dragged out the old testframework from mediachangercmdlet which uses a winform to create a runspace so you can step thru cmdlet C# code..
      namespace pcadmin.MediaChanger.PowerShell.Commands.ic
      {
          public partial class TestFrameworkPS : Form
          {
              private TextBox textBox1;
              private Label label1;
         
              public TestFrameworkPS()
              {
                  InitializeComponent();
                  //textBox1.Text = Runscript("add-pssnapin ipcmdlet");
                  // i suppose i should move formating to the script rather then the object ..
                  // '-{0}-{1}-{2}-{3}-{4}-' -f get-ip -rdraw395 1
                  // remember add-pssnapin loads the dll from original location which your post-build script needs to
                  // copy the dll to so this assumes you already did a installutil x.dll from that location so the project
                  // ref to the test dll needs to be the installation location in 1 out of 10 tests .. ie the line
                  // add-pssnapin pcadmin.MediaChanger.PowerShell.Commands.ic will load the old dll
                  //textBox1.Text = "testset for 395  \n" + Runscript("add-pssnapin ipcmdlet; for($i=0;$i -lt 20; $i++){get-ip -rdraw395 1}");
                  //textBox1.Text = textBox1.Text + "testset for 56on5";
                  //textBox1.Text = textBox1.Text + "testset for 56on5" + Runscript("add-pssnapin ipcmdlet; for($i=0;$i -lt 20; $i++){get-ip -rdraw565 1}");
                  //works textBox1.Text = Runscript("add-pssnapin pcadmin.MediaChanger.PowerShell.Commands.ic; get-mc -Slots 1");
                  //textBox1.Text = Runscript("add-pssnapin pcadmin.MediaChanger.PowerShell.Commands.ic; get-psdrive; get-psprovider; new-psdrive -name c0s0 -root mcroot -psprovider MediaChanger; get-psdrive");
                  //textBox1.Text = Runscript("add-pssnapin pcadmin.MediaChanger.PowerShell.Commands.ic;get-MC -TestRemotingObject 1; get-MC -ShowAllSlots 1");
                  textBox1.Text = Runscript("Import-Module ieBookCmdlet; Get-ISBN -ParseisbnchmAndLookupIsbnDB v:/booksDVDlib3/books/AddisonWesley");
              }
              // on a default x64 box vista installation of PS where there are two ps shells installed call to runspace calls x64 version
              // and if you install the cmd stuff from the x86 version it doesn't see it..
              // Jean-Paul Mikkers codeproject howtorunpowershell
              private string Runscript(string scriptText)
              {
                  Runspace runspace = RunspaceFactory.CreateRunspace();
                  runspace.Open();
                  Pipeline pipeline = runspace.CreatePipeline();
                  pipeline.Commands.AddScript(scriptText);
                  pipeline.Commands.Add("Out-String");
                  Collection<PSObject> results = pipeline.Invoke();
                  runspace.Close();
                  StringBuilder stringBuilder = new StringBuilder();
                  foreach (PSObject obj in results)
                  {
                      stringBuilder.AppendLine(obj.ToString());
                  }
                  return stringBuilder.ToString();
              }

              static class Program
              {
                  /// <summary>
                  /// The main entry point for the application.
                  /// </summary>
                  [STAThread]
                  static void Main()
                  {
                      Application.EnableVisualStyles();
                      Application.SetCompatibleTextRenderingDefault(false);
                      Application.Run(new TestFrameworkPS());
                  }
              }

              private void InitializeComponent()
              {
                  this.textBox1 = new System.Windows.Forms.TextBox();
                  this.label1 = new System.Windows.Forms.Label();
                  this.SuspendLayout();
                  //
                  // textBox1
                  //
                  this.textBox1.Location = new System.Drawing.Point(44, 23);
                  this.textBox1.Multiline = true;
                  this.textBox1.Name = "textBox1";
                  this.textBox1.ScrollBars = System.Windows.Forms.ScrollBars.Both;
                  this.textBox1.Size = new System.Drawing.Size(817, 274);
                  this.textBox1.TabIndex = 0;
                  //
                  // label1
                  //
                  this.label1.AutoSize = true;
                  this.label1.Location = new System.Drawing.Point(25, 1);
                  this.label1.Name = "label1";
                  this.label1.Size = new System.Drawing.Size(244, 17);
                  this.label1.TabIndex = 1;
                  this.label1.Text = "TestFramework mediachanger cmdlet";
                  //
                  // TestFrameworkPS
                  //
                  this.ClientSize = new System.Drawing.Size(890, 319);
                  this.Controls.Add(this.label1);
                  this.Controls.Add(this.textBox1);
                  this.Name = "TestFrameworkPS";
                  this.ResumeLayout(false);
                  this.PerformLayout();

              }
          }
      }

      winform runspace allows the debugger in vs2k8ee to step thru the errors whereas plain ps1 code just throws the error in the cmdlet call.

      finally corrupted the dll class calling a new object within the same object trying to fix object nullref errors by newing everyting in sight...  And when ps loads the module there is no error, but call the cmdlet and powershell completely bombs out.

      so half a baseball game to drop Teo's isbnextracor code into a powershell cmdlet.
      2 games to fix the path errors and finally gave up and did gci wrapper which perfectly fine but generates long transcript files to look at.

      and 2 met's games to try to put the gci walk within the cmdlet itself and fix errors for the sole simple syntax of get-isbn <targetdir> instead of the lookupisbn.ps1 script.

      more later..
      ivan

       
    • ivan lim

      ivan lim - 2009-07-19

      2009-07-19

      added the subdir tree walker from the http://msdn.microsoft.com/en-us/library/bb513869.aspx

      testframework needs to copy ieBookCmdlet.pdb and dll to ..\modules\ieBookCmdlet inorder for debug step to work.... like it worked for half a day, then the pdb falls out of memory and suddenly i get all sorts of strange errors trying to step thru the code  .. runspace imports-module from that location and pdb has to be visible i guess..

      lastly running about 3000 items thru the script w/o throwing a object nullref and that is done by these horrible try catch blocks and if else if else if else which makes the code totally unreadable..and even worse w/o indenting so i'll push up a copy to cvs

      this block is only for docs if i crash and burn:

      ============iebookcmdlet.cs fragment in the processrecord switch statement:

                      case "parseisbnchmAndLookupIsbnDB":
                          // having to separate fn's made sense with ps code of
                          /*
                           * gci -Recurse $args[0] | foreach -process {if (( $_.FullName -match "pdf$") -or ($_.FullName -match "chm$") )
          { $blast = $b; $b = Get-IsbnInfo -Isbnpdfchmparse $_.FullName; write-host $b + "found..working..";
          if( $b -ne $blast ) { $i = get-isbninfo -Isbndbisbnlookup $b;  $i}
           else { write-host("skipping..")} }
          }
              very cute but inflexible workaround from finding full paths which really needs to be solved within the cmdlet rather then depending
                           * on the ps1 script in conditioning the paths correctly. And the duplication problem is not minor
                           * so this is a refactor rewrite of the 3 separate functions of
                           * 1) find parse isbn out of a pdf/chm file ala Teo's isbnextractor code
                           * 2) test against database and the xml file dump location of c:\scripts\isbndb .. which can grow to 1000's of files
                           * 3) if not in the database text dir or mysql then do a isbndb lookup
                           * 4) write out the xml to the c:\scripts\isbndb dir, load into the database, copy to the targetload location with
                           * the naming convention of the input title instead of the corrected isbn derived title, copy a 10 page dump to the same location
                           * for future indexing and clips .. although gds indexes pdf's sort of..
                           * 5) idea is to put notes into the xml file at the same location ala irex/iliad's manifest file.  calibreDB location is
                           * dir based and authors is not really my prime dir choice.  but the ieBook app in VMC will be doing a mysql database of
                           * the metadata and preview/reader from the dir locations  the zen folder idea in gds..
                           */
                          //1 determine if file or directory so that you don't do stuff like gci -Recurse <> | foreach -process { dah dah dah
                          // make local to loop to clear the value in the loop .. ResultISBN isbnall = new ResultISBN("", "", "","");
                          System.IO.FileInfo fiFullPathall = new System.IO.FileInfo(ParseisbnchmAndLookupIsbnDB);
                          DirectoryInfo targetdir = new DirectoryInfo(ParseisbnchmAndLookupIsbnDB);
                          // from http://msdn.microsoft.com/en-us/library/bb513869.aspx on walking dir treeeeessss, no exception handling
                          Stack<string> dirs = new Stack<string>(20);
                          int i = 0;
                          dirs.Push(targetdir.FullName);
                          if (targetdir.Exists)
                          {
                              WriteObject("found dir instead of single input file " + ParseisbnchmAndLookupIsbnDB.ToString() + " so recursive walk of dir " );
                              while (dirs.Count > 0)
                              {
                                  string currentDir = dirs.Pop();
                                  string[] subDirs;
                                  subDirs = System.IO.Directory.GetDirectories(currentDir);
                                  //foreach (string str in subDirs) { dirs.Push(str); }
                                  //DirectoryInfo subDirsStack = new DirectoryInfo(subDirs[i++]);
                                  // try stack for recursive walk  foreach (FileInfo fi in targetdir.GetFiles())
                                  //foreach(FileInfo fi in subDirsStack.GetFiles() )
                                  string[] files = null;
                                  files = System.IO.Directory.GetFiles(currentDir);
                                  foreach(string fil in files )
                                  {
                                      System.IO.FileInfo fi = new FileInfo(fil);
                                      if ((fi.Extension == ".pdf") || (fi.Extension == ".PDF") || (fi.Extension == ".chm") || (fi.Extension == ".CHM"))
                                      {
                                          WebScrappers isbnWBall = new WebScrappers();
                                          isbnExtractor ieo2 = new isbnExtractor();
                                          object xmlrtnall = null;
                                          ResultISBN isbnall = new ResultISBN("", "");
                                          WriteObject("working on " + fi.FullName);
                                          try
                                          {
                                              //isbnall = ieo2.GetDocumentISBN(fi.FullName.ToString());
                                              isbnall = GetDocumentISBN(fi.FullName);
                                              //WriteObject("getdoc->parseISBNwithPDFBox call throws null ref when it can't find an isbn");
                                          }
                                          catch (Exception e) { WriteObject(e.Message.ToString()); isbnall.Isbn = ""; }
                                          //if (isbnall.Isbn.Length < 1)  // throws nullrefexception
                                          if ((isbnall.Isbn == null) || (isbnall.Isbn == "nomatch"))
                                          {
                                              WriteObject("no isbn found");
                                          }
                                          else
                                          {
                                              WriteObject("found " + isbnall.Isbn.ToString());
                                              //if (isbnxmlExists(isbnall.Isbn.ToString()))
                                              FileInfo copyxml2target = null;
                                              try
                                              {
                                                  copyxml2target = new FileInfo(isbnxmlFileExists(isbnall.Isbn));
                                              }
                                              catch (Exception e) { WriteObject(e.Message.ToString()); copyxml2target = null; }
                                              if (copyxml2target != null)
                                              {
                                                  // then skips this entry
                                                  WriteObject(Isbndbisbnlookup + " already exists in c:\\scripts\\isbnDB or it exists in mysql table so skip lookup and write xml/txt/index to location");
                                                  // but i want to read that xml file and copy it to the location which is a calibre/iliad oddity
                                                  try
                                                  {
                                                      copyxml2target.CopyTo(fi.DirectoryName + "\\" + copyxml2target.Name, false);
                                                  }
                                                  catch (Exception e) { WriteObject(e.Message.ToString()); }
                                              }
                                              else
                                              {
                                                  xmlrtnall = isbnWBall.findBookXMLSearchISBNdb(isbnall.Isbn.ToString());
                                                  //System.IO.FileInfo fi = new System.IO.FileInfo(f);
                                                  string fipathdir = fi.DirectoryName; // Isbnpdfchmparse.Substring(pathpart + 1);
                                                  string finameTitle = fi.Name; // Isbnpdfchmparse.Substring(Isbnpdfchmparse.Length - pathpart);
                                                  finameTitle = finameTitle.Replace(" ", "_");
                                                  finameTitle = finameTitle.Replace("<", "");
                                                  finameTitle = finameTitle.Replace(">", "");
                                                  finameTitle = finameTitle.Replace("/", "");
                                                  finameTitle = finameTitle.Replace(":", "");
                                                  finameTitle = finameTitle.Replace("\"", "");
                                                  string finameTitleOnly;
                                                  if (finameTitle.Length > 4)
                                                  {
                                                      finameTitleOnly = finameTitle.Substring(0, finameTitle.Length - 4);  // chop off the file extension a 3 char ext and dot
                                                  }
                                                  else { finameTitleOnly = finameTitle; }
                                                  XmlWriterSettings settings = new XmlWriterSettings();
                                                  if ((isbnall.Result == null) || (xmlrtnall.ToString().Contains("<ErrorMessage>") ) )
                                                  {
                                                      WriteObject("no result page for isbn pdf/chm or isbndb connection query limit reached for the day");
                                                  }
                                                  else
                                                  {
                                                      string file10page = fipathdir + "\\" + finameTitleOnly + "_[" + isbnall.Isbn.ToString() + "]_10pgs.txt";
                                                      FileInfo t = new FileInfo(file10page);
                                                      StreamWriter Tx = t.CreateText();
                                                      Tx.Write(isbnall.Result.ToString());
                                                      Tx.Close();

                                                      string xmlfile = fipathdir + "\\" + finameTitleOnly + "_[" + isbnall.Isbn.ToString() + "].xml";
                                                      XmlWriter tmpxml = XmlWriter.Create(xmlfile, settings);
                                                      //tmpxml.WriteRaw(isbn.Isbn.ToString());  // does writeraw add it's own xml header?
                                                      tmpxml.WriteRaw(xmlrtnall.ToString());  // does writeraw add it's own xml header?

                                                      // want to write this same file to the location of the pdf/chm
                                                      tmpxml.Close();
                                                  }
                                              }
                                          }
                                      }

                                  }
                                  foreach (string str in subDirs) { dirs.Push(str); }

                              } // recursive stack walk of dirs

                          }
                          else
                          {
                              WriteObject("did not find a directory so working on a file " + ParseisbnchmAndLookupIsbnDB.ToString());

                          }

                          break;
                      case "isbndbXMLdirLoader":
      ....
      =====================

      stack dir walker works w/o the long list of try/catch for thread errors .. might have to add those later and it walks thru the library (another 2T to go...)

      need to add a logger scheduler since isbndb.com limits to 500 queries..and i just keep a window open to the acc't page and watch them rack up and then kill the script.  detecting <ErrorMessage>AccessKey<ErrorMessage/> in xml could also throw and stop the script later..right now the bug writes out the error.xml as the booktitle.xml and you got to look at it and delete them, rerun the next day.

      more later..peeking at wordpress blog thingy to see if it formats code but right now has to be freebie and maybe just depend on cvs browser..
      ivan

       

Log in to post a comment.