Recent changes to Home

WikiPage Home modified by Zhuoran Wang

Zhuoran Wang — Mon, 08 Aug 2011 12:24:57 -0000

--- v2 
+++ v3 
@@ -1,96 +1,13 @@
-GENERAL DESCRIPTION AND USER GUIDE
-
-1 Outline
 The Learnehr program implements the Semi-Supervised Set Covering Machine (S3CM) algorithm 
 for Electronic Health Record (EHR) freetext classification. This program was developed by 
 Dr Zhuoran Wang initially in 2010, and further revised in 2011, while his was working in 
 the Centre for Statistics and Machine Learning at University College London. This work is 
 part of the Wellcome Trust and NIHR funded project CALIBER (http://www.caliberresearch.org.uk/).
 
 A detailed introduction and empirical evaluations of the algorithm can be found in the 
 following manuscript.
 
   Z. Wang, A. D. Shah, A. R. Tate, S. Denaxas, J. Shawe-Taylor and H. Hemingway. Extracting 
   diagnoses from unstructured text in electronic health records by semi-supervised machine 
   learning. 2011.
 
-2 Compilation
-The source code is in GNU C++ and STL. In principle it would work with any C++ compiler that 
-supports STL, but has only been tested for g++ (version >= 4.2) in Linux/Mac OS X. To build 
-the program, uncompress everything into a folder, e.g. ‘Learnehr’, and run under the directory:
-
-  user@host: Learnehr$ make all
-
-3 Command-line arguments
-To run the program, three command-line arguments are required:
-
-  $ ./Learnehr CaseRecords ControlRecords ReadCodeList
-
-CaseRecords is a GPRD format TSV file containing the EHR records of case patients in a case-control 
-study (i.e. positive and unlabelled examples). 
-
-ControlRecords is a GPRD format TSV file containing the EHR records of control patients in a case-
-control study (i.e. negative examples).
-
-ReadCodeList is a plain text file containing a list of Read code to identify positive examples, 
-with each Read code a line.
-
-4 Output options
-The program will print the learned rules on the screen, as well as the statistics of the errors if 
-manual annotations for the unlabelled records are available. Here is an example output:
-
-  -------learned rules------- 
-  admit anterior 
-  infarct 
-  vessel diseas 
-  coronari stenosi
-  lad 
-  mid 
-  proxim 
-  rca 
-  occlud 
-  stem 
-  circumflex 
-  lv medic 
-  moder arteri 
-  stent 
-  graft om 
-  margin branch 
-  pda 
-  lm 
-  -----------------------
-  true positive: 96 
-  false positive: 55 
-  false negative: 15
-
-In addition, if specified in the configuration file, it can also output the classification results 
-into an HTML file with predictions and errors highlighted by different colours for visualisation 
-purposes.
-
-5 Input format	
-The GPRD EHR data are stored in tab-separated values (TSV) files. The first row of the file gives 
-the column names. There are 11 columns in the original file for- mat, with sequence as: proj_id, 
-type, praceid, staffid, pateid, eventdate, entity_name, readcode, readterm, anon_text, wordcount. 
-If manual annotated labels are available, one can append a column to the respective records, with 
-a label ‘y’ standing for a positive example.
-
-6 Configuration file
-The configuration file config.ini must be placed under the same directory as the Learnehr executable. 
-Its sections and properties are described as follows.
-
-The 'Model' section specifies model parameters for mSCM and S3CM.
-  mu1: penalty coefficient for unlabelled examples in mSCM. 
-  mu2: penalty coefficient for negative examples in mSCM. 
-  Rule: maximum number of rules to learn in an mSCM run. 
-  Iter: number of bootstrapping iterations for S3CM.
-
-The 'Settings' section specifies some learning settings. 
-  IncludeReadTerm: whether to include the Read term as a part of the freetext.
-  RuleOverlap: whether overlapped rules are allowed, 
-               e.g. when ‘lad’ has been learned, do we still consider ‘lad occlud’.
-
-The 'Resource' section is optional but highly recommended. 
-  Stopword: path to a file containing a list of stop-words.
-
-The 'Output' section is optional. 
-  Filename: filename to output the visibly marked classification results.

WikiPage Home modified by Zhuoran Wang

Zhuoran Wang — Mon, 08 Aug 2011 12:23:24 -0000

--- v1 
+++ v2 
@@ -1,5 +1,96 @@
-Welcome to your wiki!
-
-This is the default page, edit it as you see fit. To add a page simply reference it within brackets, e.g.: [SamplePage].
-
-The wiki uses [Markdown](/p/learnehr/home/markdown_syntax/) syntax.
+GENERAL DESCRIPTION AND USER GUIDE
+
+1 Outline
+The Learnehr program implements the Semi-Supervised Set Covering Machine (S3CM) algorithm 
+for Electronic Health Record (EHR) freetext classification. This program was developed by 
+Dr Zhuoran Wang initially in 2010, and further revised in 2011, while his was working in 
+the Centre for Statistics and Machine Learning at University College London. This work is 
+part of the Wellcome Trust and NIHR funded project CALIBER (http://www.caliberresearch.org.uk/).
+
+A detailed introduction and empirical evaluations of the algorithm can be found in the 
+following manuscript.
+
+  Z. Wang, A. D. Shah, A. R. Tate, S. Denaxas, J. Shawe-Taylor and H. Hemingway. Extracting 
+  diagnoses from unstructured text in electronic health records by semi-supervised machine 
+  learning. 2011.
+
+2 Compilation
+The source code is in GNU C++ and STL. In principle it would work with any C++ compiler that 
+supports STL, but has only been tested for g++ (version >= 4.2) in Linux/Mac OS X. To build 
+the program, uncompress everything into a folder, e.g. ‘Learnehr’, and run under the directory:
+
+  user@host: Learnehr$ make all
+
+3 Command-line arguments
+To run the program, three command-line arguments are required:
+
+  $ ./Learnehr CaseRecords ControlRecords ReadCodeList
+
+CaseRecords is a GPRD format TSV file containing the EHR records of case patients in a case-control 
+study (i.e. positive and unlabelled examples). 
+
+ControlRecords is a GPRD format TSV file containing the EHR records of control patients in a case-
+control study (i.e. negative examples).
+
+ReadCodeList is a plain text file containing a list of Read code to identify positive examples, 
+with each Read code a line.
+
+4 Output options
+The program will print the learned rules on the screen, as well as the statistics of the errors if 
+manual annotations for the unlabelled records are available. Here is an example output:
+
+  -------learned rules------- 
+  admit anterior 
+  infarct 
+  vessel diseas 
+  coronari stenosi
+  lad 
+  mid 
+  proxim 
+  rca 
+  occlud 
+  stem 
+  circumflex 
+  lv medic 
+  moder arteri 
+  stent 
+  graft om 
+  margin branch 
+  pda 
+  lm 
+  -----------------------
+  true positive: 96 
+  false positive: 55 
+  false negative: 15
+
+In addition, if specified in the configuration file, it can also output the classification results 
+into an HTML file with predictions and errors highlighted by different colours for visualisation 
+purposes.
+
+5 Input format	
+The GPRD EHR data are stored in tab-separated values (TSV) files. The first row of the file gives 
+the column names. There are 11 columns in the original file for- mat, with sequence as: proj_id, 
+type, praceid, staffid, pateid, eventdate, entity_name, readcode, readterm, anon_text, wordcount. 
+If manual annotated labels are available, one can append a column to the respective records, with 
+a label ‘y’ standing for a positive example.
+
+6 Configuration file
+The configuration file config.ini must be placed under the same directory as the Learnehr executable. 
+Its sections and properties are described as follows.
+
+The 'Model' section specifies model parameters for mSCM and S3CM.
+  mu1: penalty coefficient for unlabelled examples in mSCM. 
+  mu2: penalty coefficient for negative examples in mSCM. 
+  Rule: maximum number of rules to learn in an mSCM run. 
+  Iter: number of bootstrapping iterations for S3CM.
+
+The 'Settings' section specifies some learning settings. 
+  IncludeReadTerm: whether to include the Read term as a part of the freetext.
+  RuleOverlap: whether overlapped rules are allowed, 
+               e.g. when ‘lad’ has been learned, do we still consider ‘lad occlud’.
+
+The 'Resource' section is optional but highly recommended. 
+  Stopword: path to a file containing a list of stop-words.
+
+The 'Output' section is optional. 
+  Filename: filename to output the visibly marked classification results.

WikiPage Home modified by Zhuoran Wang

Zhuoran Wang — Mon, 08 Aug 2011 11:45:58 -0000

Welcome to your wiki! This is the default page, edit it as you see fit. To add a page simply reference it within brackets, e.g.: [SamplePage]. The wiki uses [Markdown](/p/learnehr/home/markdown_syntax/) syntax.