Name | Modified | Size | Downloads / Week |
---|---|---|---|
phalanx-deidentify.zip | 2019-09-10 | 102.6 MB | |
README.TXT | 2019-09-07 | 4.7 kB | |
Totals: 2 Items | 102.6 MB | 0 |
# README.TXT /* Phalanx - Deidentify MIT License */ Phalanx - In Safe Harbor Document Deidentification Mode Thanks for downloading. The enclosed software is already built and the resources configured. After running CONFIG.SH, you may be able to run the program. To build this program you need: g++ and boost libraries greater than version 1.66 To recalculate the supporting resources (dictionaries, rules, etc.) you need: python (os,sys,re,glob,itertools), perl, shasum Some tools for evalution and debugging require python distutils and Cython Phalanx is a general purpose, high performance NLP platform for processing all corpora in a folder. It's defining features include: trivial string operation overhead through compiled resources, a minimal token footprint, a fast yet extensive rule engine, and control features that allow for adjustable processing from 1 token at a time to full batch. It acheives a fast document processing speed by extensive initialization and high memory use. It leaks memory at about 250MB/hour. To run Phalanx - DeIdentify Put all the files that you want to process into a directory and link 'Corpora' to that directory There must be a directory linked to 'Corpora' and that directory must have at least 1 file. Run the program Phalanx - Deidentify: ./Controller Deidentify_Controller_Node To use Phalanx - DeIdentify as an API Put all the files that you want to process into a directory and link 'Corpora' to that directory There must be a directory linked to 'Corpora' and that directory must have at least 1 file. C++ void process_all_documents() void process_next_document() void reinit_nodes() void mark_end_of_processing() void sort_files_in_queue_by_size(); long add_file_to_queue(string text_file); long add_directory_to_queue(string directory_file); Create a Controller object and then process documents. Example: Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max); reader1->process_all_documents(); OR Controller *reader1 = new Controller(argv[1],main_log,&main_log_mark,main_log_max); while(next_file_available) { reader1->process_next_document(); next_file_available = reader1->get_next_file(); if(next_file_available) { reader1->reinit_nodes(); } else { reader1->mark_end_of_processing(); } } Python This is enabled through Cython. Compiled and linked files are included. It mirrors the C++ API safe_harbor_deid() process_all_documents() reinit_nodes() mark_end_of_processing() sort_files_in_queue_by_size(); add_file_to_queue(string); add_directory_to_queue(string); import safe_harbor_deid deid = safe_harbor_deid.safe_harbor_deid() deid.process_all_documents() Safe Harbor Deidentification Mode of Phalanx is an abridged pipeline of NLP annotators culminating in NER annotators which write output of text offsets. It uses the Safe Harbor deidentification method. To run the MIMIC2 corpus through Phalanx - Deidentify: Download MIMIC2 Copy the corpus file into your CorporaFolder Run Phalanx - Deidentify Transform in MIMIC2 format: perl output_mimic2.pl ../Corpora/id.text ../Workspace/PII_Output1.csv > out_mimic.csv Run your own evalutation or use the one included with MIMIC2 deid HOW DOES THIS COMPARE TO OTHER DEIDENTIFICATION PROGRAMS deid is by MIT and bundled with MIMIC2 scrubber is by NIH MIMIC2 Specificity Sensitivity Time Memory CPU Load Phalanx - Deidentify .81 .97 19s 850MB 1 deid .75 .97 340s 90MB 1 scrubber ~.25 ~.99 120s 2.1GB 2 i2b2-2014 Deidentification (combined into 1 file) Specificity Sensitivity Time Memory CPU Load Phalanx - Deidentify .90 .80 25s 750MB 1 deid .87 .61 647s 90MB 1 scrubber 120s 2.1GB 2 i2b2-2014 Deidentification (514 files) Specificity Sensitivity Time Memory CPU Load Phalanx - Deidentify - - 25s 250MB 1 deid - - 845s 90MB 1 scrubber 90s 1GB 2