Download Latest Version phalanx-deidentify.zip (102.6 MB)
Email in envelope

Get an email when there's a new version of Safe Harbor Deidentification

Home
Name Modified Size InfoDownloads / Week
phalanx-deidentify.zip 2019-09-10 102.6 MB
README.TXT 2019-09-07 4.7 kB
Totals: 2 Items   102.6 MB 0
# README.TXT

/*
Phalanx - Deidentify
MIT License
*/
Phalanx
- In Safe Harbor Document Deidentification Mode

Thanks for downloading.



The enclosed software is already built and the resources configured.  After running CONFIG.SH, you may be able to run the program.

To build this program you need:   g++ and boost libraries greater than version 1.66
To recalculate the supporting resources (dictionaries, rules, etc.) you need: python (os,sys,re,glob,itertools), perl, shasum
Some tools for evalution and debugging require python distutils and Cython



Phalanx is a general purpose, high performance NLP platform for processing all corpora in a folder.  It's defining features include: trivial string operation overhead through compiled resources, a minimal token footprint, a fast yet extensive rule engine, and control features that allow for adjustable processing from 1 token at a time to full batch.
It acheives a fast document processing speed by extensive initialization and high memory use.  It leaks memory at about 250MB/hour.

To run Phalanx - DeIdentify
  Put all the files that you want to process into a directory and link 'Corpora' to that directory
  There must be a directory linked to 'Corpora' and that directory must have at least 1 file.
  Run the program Phalanx - Deidentify: ./Controller Deidentify_Controller_Node
  
To use Phalanx - DeIdentify as an API
  Put all the files that you want to process into a directory and link 'Corpora' to that directory
  There must be a directory linked to 'Corpora' and that directory must have at least 1 file.
  
  C++
  void process_all_documents()
  void process_next_document()
  void reinit_nodes()
  void mark_end_of_processing()
  void sort_files_in_queue_by_size();
  long add_file_to_queue(string text_file);
  long add_directory_to_queue(string directory_file);

  Create a Controller object and then process documents.  Example:
    Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max);
    reader1->process_all_documents();
  
  OR
  
  Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max);
  while(next_file_available) {
    reader1->process_next_document();
    next_file_available = reader1->get_next_file();
    if(next_file_available) {
      reader1->reinit_nodes();
    }
    else {
      reader1->mark_end_of_processing();
    }
  }

  Python
  This is enabled through Cython. Compiled and linked files are included. It mirrors the C++ API
  safe_harbor_deid()
  process_all_documents()
  reinit_nodes()
  mark_end_of_processing()
  sort_files_in_queue_by_size();
  add_file_to_queue(string);
  add_directory_to_queue(string);
  
  import safe_harbor_deid
  deid = safe_harbor_deid.safe_harbor_deid()
  deid.process_all_documents()
  


Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets.  It uses the Safe Harbor deidentification method.
To run the MIMIC2 corpus through Phalanx - Deidentify:
  Download MIMIC2
  Copy the corpus file into your CorporaFolder
  Run Phalanx - Deidentify
  Transform in MIMIC2 format: perl output_mimic2.pl ../Corpora/id.text ../Workspace/PII_Output1.csv > out_mimic.csv
  Run your own evalutation or use the one included with MIMIC2 deid
  

HOW DOES THIS COMPARE TO OTHER DEIDENTIFICATION PROGRAMS
deid is by MIT and bundled with MIMIC2
scrubber is by NIH

MIMIC2
                                    Specificity   Sensitivity  Time       Memory      CPU Load
Phalanx - Deidentify     .81              .97             19s         850MB        1
deid                              .75              .97             340s       90MB          1
scrubber                       ~.25           ~.99           120s       2.1GB         2

i2b2-2014 Deidentification     (combined into 1 file)
                                    Specificity   Sensitivity  Time       Memory      CPU Load
Phalanx - Deidentify     .90              .80             25s         750MB        1
deid                              .87              .61             647s       90MB          1
scrubber                                                            120s       2.1GB         2

i2b2-2014 Deidentification     (514 files)
                                    Specificity   Sensitivity  Time       Memory      CPU Load
Phalanx - Deidentify     -                  -                 25s         250MB        1
deid                              -                  -                 845s       90MB          1
scrubber                                                            90s         1GB             2

Source: README.TXT, updated 2019-09-07