Re: [Senseclusters-users] Unsupervided Relation labeling
Status: Beta
Brought to you by:
tpederse
From: Stefano S. <ste...@gm...> - 2014-12-09 16:32:28
|
Hi Ted, sorry for my delay, but this time we were on holidays in Italy too. I'll respond instantly, to describe the differences between your example and the files in our experiment (1) and then I need even another explanation (2). (1) As you shown, the rlabel file associates each cluster element to the istance id of the context. My rlabel file is in CLUTO style: the file has on the i-th line the "label" of the i-th element of the cluster solution, that is the name of the extracted medical relation found by my system. So my rlabel file is not a sequence of number, but a sequence of names, like: leg bone Cardiology E.R. blood transfusion ... Those words are all in my contexts (the medical records used in the experiment), but it could happen that two entities of the cluster member (each relation clustered is made by two "entity", a single word or a multiword extracted from the UMLS databases) come from two different context (medical record). So I can't create a file like yours. To overcome this case, I've created a single big context, melting all the medical record in one context (on one single line). Can this trick be useful? And, in this case, will the cluster label be still functional? (2) My clustering works on a huge dataset and we use a really large number of features in the n-dimensional space. I tried to perform a sort of cluster labeling using the -showfeatures option in CLUTO and, in this case, I could see the discriminative and descrtiptive features of each cluster. My question is: does your cluster labeling use the same algorithm of CLUTO showfeatures? Could you summarize the step of the used algorithm. Thank you for your help, have a pleasant day, Stefano 2014-12-02 3:06 GMT+01:00 Ted Pedersen <dul...@gm...>: > Hi Stefano, > > Thanks for your patience. I decided to construct my own little example > here, and maybe through that we could see what might be different in > your case. > > Here is my input file : > > hi ted i am here > what is your name > > And here is this file after running text2sval.pl > > <corpus lang="english"> > <lexelt item="LEXELT"> > <instance id="0"> > <answer instance="0" senseid="NOTAG"/> > <context> > hi ted i am here > </context> > </instance> > <instance id="1"> > <answer instance="1" senseid="NOTAG"/> > <context> > what is your name > </context> > </instance> > </lexelt> > </corpus> > > Now, I created an rlabel file : > > 0 > 1 > > And a cluster_solutions file : > > 1 > 1 > > When I run all of the above with format_clusters.pl, I get something like > this : > > format_clusters.pl testfile.cluster_solution testfile.rlabel > --context testfile.sval2 > <cluster id="1"> > <instance id="0"> > <context> > hi ted i am here > </context> > </instance> > <instance id="1"> > <context> > what is your name > </context> > </instance> > </cluster> > > This shows us that instance 0 and 1 are both found in cluster 1, which > is what I intended. > > Now, I am wondering if that points out any differences with what you > did? I will continue to see if I can re-create your error - if so then > I'm confident we can figure this out. > > More soon, and let me know if you see anything here that seems > relevant, or if I'm totally off track! > > Thanks, > Ted > > On Wed, Nov 26, 2014 at 3:19 PM, Stefano Silvestri > <ste...@gm...> wrote: > > Hi Ted, > > as described in the previous email, I've launched my experiment. As said, > > the final step of my pipeline is the cluster labeling, using > Sensclusters. > > I want to remember to you that the system performs an unsupervised > relation > > extraction from the entities found in 988 clinical records (the entities > > have been extracted through UMLS databases and we cluster the couples of > > entities). > > > > To integrate Sunslusters cluster_label in our system, I've produced a > > cluto-style output for the clustering results (around 160000 elements) > and > > an rlabel file (same number), with the list of all the clustered > elements. > > At this point, I have problems in running format_cluster. > > > > To perform the labeling, I need the the format_cluster's output, > generated > > with the --context option. So, I've created a senseval-2 file with > > text2sval.pl. The input file of text2sval is a plain text with each > whole > > clinical record on each line. > > Naturally, each context contains more than one cluster members. > > I haven't used any optional argument in text2sval. > > > > This output has 988 instance ids. Now, when I try to launch > format_cluster, > > I have the following error, occurring during the parse of the senseval > file: > > Use of uninitialized value $sentence in pattern match (m//) at > > ../.cpan/build/Text-SenseClusters-1.03-FMoSjn/Toolkit/evaluate/ > format_clusters.pl > > line 309, <SCON> line 5938. (when it reaches the last line of senseval2 > > file). > > > > I'm thinking that the context used are wrong... so my question are: > > 1) do I have to put in the context only the extracted entities or the > > relations? > > 2) Do the contexts must be in the same number of clustered elements? > > 3) If nothing is (theoretically) wrong, what should be the error in the > > sense-eval file? > > > > I'm waiting for your response... > > Thank you for the attention and I hope that you can help us to complete > our > > research. > > > > > > 2014-10-23 16:02 GMT+02:00 Stefano Silvestri < > ste...@gm...>: > >> > >> Hi Ted and thanks. > >> > >> The PoS tagging, entity recognition, feature extraction and the > clustering > >> tasks have been created with our system (not Senseclusters) - still in > >> developement. > >> Now I'm trying to use the cluster_labeling module of SenseClusters to > show > >> that we have found, in a unsupervised approach, the relation between > medical > >> entities in the clinical records (i.e. diabetes mellitus <> glycemia) > and > >> have, in this way, some labels for the clusters. > >> > >> I'm now writing the code to create the context files and then I'll run > the > >> experiments on cluster labeling. I'll let you know in a few days if > >> everything worked well and, in case of a new publication, I'll cite your > >> great work. > >> > >> I'm sure that I will ask some more things in the next days, so I thank > you > >> in advance. > >> Stefano Silvestri > >> > >> > >> 2014-10-23 15:07 GMT+02:00 Ted Pedersen <dul...@gm...>: > >>> > >>> Hi Stefano, > >>> > >>> This sounds like an interesting project, and it's good to know > >>> SenseClusters is proving to be useful. See my responses inline... > >>> > >>> On Wed, Oct 22, 2014 at 5:58 AM, Stefano Silvestri > >>> <ste...@gm...> wrote: > >>> > I've used a clustering techniques to discover, in an unsupervised > way, > >>> > relations between medical entities contained in a large collection of > >>> > anonymized medical records, in a reserch project of University of > >>> > Neaples. > >>> > The data set is composed by a large set of features - all the results > >>> > will > >>> > be shortly published on a journal. > >>> > > >>> > The next step in the development of our system is performing an > >>> > unsupervised > >>> > cluster (relation) labeling. To do that, I think to try the > >>> > clusterlabeling > >>> > module from Senseclusters. For creating the input to clusterlabeling > I > >>> > have > >>> > to use format_clusters module with --context option and now I have > some > >>> > problems. > >>> > > >>> > I have already produced a cluto-style cluster solution file (no > problem > >>> > for > >>> > that) from my system. > >>> > > >>> > The rlabel file, if I'm right, is a file containing the explicit > >>> > corresponding name of each entity in the cluster (in my case the > >>> > relation). > >>> > Is that right? > >>> > >>> Yes, rlabel shows the cluster to which each instance has been assigned. > >>> > >>> > > >>> > And now the problems about the context file... > >>> > It should be in senseval2 format. My experimental assesment is made > of > >>> > a > >>> > plain text files - so I should use plain text to headless senseval2 > >>> > utility. > >>> > > >>> > I have some questions. > >>> > > >>> > 1) Does the context file have to put together all my input files (the > >>> > medical records) in one large file (and each context must correspond > to > >>> > a > >>> > medical record)? > >>> > >>> Yes, the input for each run of SenseClusters should be a single file > >>> with all your contexts included. > >>> > >>> > > >>> > 2) Does the contexts be headless, or I have to tag (<head></head>) > all > >>> > the > >>> > entities (medical names) in input? > >>> > >>> Your contexts can be headless, and so there is no need to include > >>> <head> tags in your contexts. > >>> > >>> > > >>> > 3) Are other costrains in the context files (formatting, tags, or > >>> > other)? > >>> > > >>> > >>> There shouldn't be. The output from text2sval.pl should be acceptable > >>> for input "as is". > >>> > >>> > In case of success of the experiments, of course, I'll credit and > cite > >>> > the > >>> > Senseclusters project. > >>> > > >>> > PS - my system works on italian language. > >>> > >>> That's great! We'd be happy to answer further questions as they arise, > >>> and will be curious to know how things work out! > >>> > >>> Good luck, > >>> Ted > >>> > >>> > > >>> > Thanks for response, > >>> > Stefano Silvestri, > >>> > NLP researcher at University of Neaples "Federico II" > >>> > > >>> > > >>> > > ------------------------------------------------------------------------------ > >>> > Comprehensive Server Monitoring with Site24x7. > >>> > Monitor 10 servers for $9/Month. > >>> > Get alerted through email, SMS, voice calls or mobile push > >>> > notifications. > >>> > Take corrective actions from your mobile device. > >>> > http://p.sf.net/sfu/Zoho > >>> > _______________________________________________ > >>> > senseclusters-users mailing list > >>> > sen...@li... > >>> > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > >>> > > >>> > >>> > >>> > >>> -- > >>> Ted Pedersen > >>> http://www.d.umn.edu/~tpederse > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> _______________________________________________ > >>> senseclusters-users mailing list > >>> sen...@li... > >>> https://lists.sourceforge.net/lists/listinfo/senseclusters-users > >> > >> > > > > > > > ------------------------------------------------------------------------------ > > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > > with Interactivity, Sharing, Native Excel Exports, App Integration & more > > Get technology previously reserved for billion-dollar corporations, FREE > > > http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk > > _______________________________________________ > > senseclusters-users mailing list > > sen...@li... > > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > > > > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > > http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk > _______________________________________________ > senseclusters-users mailing list > sen...@li... > https://lists.sourceforge.net/lists/listinfo/senseclusters-users > |