Re: [Mulan-list] Mulan-list Digest, Vol 37, Issue 3
Brought to you by:
stevelaskaridis,
tsoumakas
From: Eleftherios Spyromitros-X. <esp...@cs...> - 2014-07-23 13:03:49
|
I am not sure if I get your problem correctly but it seems that you just want to unset the class attribute. This can be easily done by using the setClassIndex() method of the Instances class. Just set it to -1. HTH, Lefteris From: Mariela Da Graca Guerra [mailto:mar...@gm...] Sent: Wednesday, July 23, 2014 2:58 PM To: mul...@li... Subject: Re: [Mulan-list] Mulan-list Digest, Vol 37, Issue 3 Greg thank you for answer me. I can not use the Weka Explorer to do it because I want to do in my Java code to generate the arff files automatically. I tried to use NominalToBinary Java class to do it but it works for all the attributes that are not the class attribute and the class attribute is one of the nominal ones. I will test creating a new nomina to binary class in Java and extends it from the Weka NominalToBinary class but if anybody knows how to do it in a better way, please let me know. Thanks for your help Mariela 2014-07-22 18:10 GMT-03:00 <mul...@li...>: Send Mulan-list mailing list submissions to mul...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/mulan-list or, via email, send a message with subject or body 'help' to mul...@li... You can reach the person managing the list at mul...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Mulan-list digest..." Today's Topics: 1. Re: Mulan-list Digest, Vol 37, Issue 1 (Grigorios Tsoumakas) ---------------------------------------------------------------------- Message: 1 Date: Wed, 23 Jul 2014 00:10:28 +0300 From: Grigorios Tsoumakas <gr...@cs...> Subject: Re: [Mulan-list] Mulan-list Digest, Vol 37, Issue 1 To: mul...@li... Message-ID: <53C...@cs...> Content-Type: text/plain; charset="iso-8859-1" Hi Mariela, From Weka Explorer: 1) set the class attribute to "No class". 2) use the "unsupervised" version of the NominalToBinary filter I told you. Now, this should create four new "numeric" attributes: animals,politics,sports,technology What remains is to open the arff file via a text editor and change the lines of these attributes as follows: @attribute class=animals numeric ----> @attribute class=animals {0,1} etc. Again, you should take notice of the caveat I mention in the previous e-mail: copies of documents in different folder will be considered as different documents. So, ultimately only with new code this can be achieved properly. Hope this helps, Greg On 22/07/2014 09:06 ??, Mariela Da Graca Guerra wrote: > Hello again, > I modified my program using the filter NominalToBinary but the class > attribute is not changed. The method setOutputFormat makes the > transformation if the attribute is not the class attributed. > The only nominal attribute that there is in my arff file is @@class@@. > Is there any way to transform it to binary? I want to create the arff > in Mulan format to use multi label classification. > Thanks in advance > Cheers > Mariela > > > 2014-07-19 5:56 GMT-03:00 <mul...@li... > <mailto:mul...@li...>>: > > Send Mulan-list mailing list submissions to > mul...@li... > <mailto:mul...@li...> > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/mulan-list > or, via email, send a message with subject or body 'help' to > mul...@li... > <mailto:mul...@li...> > > You can reach the person managing the list at > mul...@li... > <mailto:mul...@li...> > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Mulan-list digest..." > > > Today's Topics: > > 1. Arff Files (Mariela Da Graca Guerra) > 2. Re: Arff Files (Grigorios Tsoumakas) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 18 Jul 2014 12:18:55 -0300 > From: Mariela Da Graca Guerra <mar...@gm... > <mailto:mar...@gm...>> > Subject: [Mulan-list] Arff Files > To: mul...@li... > <mailto:mul...@li...> > Message-ID: > <CALKp2T2Q=fLyLVfdCYhVFa8wOOfwk9JysBw=zz=-F8ZDS2e=pQ...@ma... <mailto:pQ...@ma...>> > Content-Type: text/plain; charset="utf-8" > > Hi, > I'm new using Mulan and Weka and I'm trying to do a multi label > classification of txt files in Java. In order to do it, I'm using > TextDirectoryLoader and StringToWordVector from Weka but the ARFF > generated > doesn't have the format required by Mulan. > > Example of my code: > > TextDirectoryLoader loader = new TextDirectoryLoader(); > > String classesFolder = ?MyPathToTxtFolder?; // This folder has one > subfolder per class with the txt files > > loader.setDirectory(new File(classesFolder)); > > Instances dataRaw = loader.getDataSet(); > > StringToWordVector filter = new StringToWordVector(); > > filter.setWordsToKeep(1000000); > > filter.setUseStoplist(true); > > filter.setInputFormat(dataRaw); > > Instances dataFiltered = Filter.useFilter(dataRaw, filter); > After this code, the arff file is printed and I has the following > format in > my text file: > > @relation 'MyFilename' > > @attribute @@class@@ {animals,politics,sports,technology} > > @attribute #WorldCuptrophy numeric > > @attribute $200m numeric > > @attribute $3 numeric > > @attribute Air numeric > > @attribute Airborne numeric > > @attribute Answer numeric > > @attribute Anthrax numeric > > @attribute Apple numeric > > @attribute Apple-IBM numeric > ... > > @data > > {6 1,7 1,52 1,64 1,77 1,78 1,229 1,231 1,297 1,391 1,458 1,498 > 1,708 1,731 > 1,756 1,762 1,800 1,801 1,813 1,828 1,833 1,839 1,856 1,963 1,970 > 1,973 > 1,974 1,1019 1,1041 1,1052 1,1057 1,1059 1,1089 1,1145 1,1152 > 1,1189 1,1292 > 1,1331 1,1368 1,1451 1,1560 1,1563 1,1654 1,1792 1,1801 1,1865 1,1866 > 1,1877 1} > > {13 1,14 1,25 1,29 1,33 1,38 1,76 1,101 1,105 1,106 1,259 1,294 > 1,323 1,329 > 1,334 1,366 1,414 1,431 1,466 1,473 1,488 1,493 1,499 1,509 1,519 > 1,520 > 1,533 1,628 1,633 1,644 1,646 1,654 1,656 1,661 1,687 1,720 1,748 > 1,829 > 1,872 1,894 1,911 1,918 1,925 1,939 1,950 1,956 1,957 1,1002 > 1,1029 1,1035 > 1,1038 1,1064 1,1077 1,1078 1,1079 1,1098 1,1117 1,1122 1,1123 1,1134 > 1,1148 1,1175 1,1236 1,1237 1,1250 1,1285 1,1295 1,1296 1,1298 1,1299 > 1,1300 1,1301 1,1302 1,1314 1,1351 1,1352 1,1361 1,1373 1,1380 1,1389 > 1,1404 1,1407 1,1408 1,1432 1,1456 1,1469 1,1489 1,1491 1,1493 1,1501 > 1,1523 1,1540 1,1546 1,1564 1,1569 1,1570 1,1571 1,1573 1,1595 1,1596 > 1,1610 1,1614 1,1615 1,1627 1,1634 1,1650 1,1691 1,1692 1,1698 1,1699 > 1,1706 1,1741 1,1742 1,1743 1,1758 1,1791 1,1794 1,1801 1,1802 1,1807 > 1,1841 1,1848 1,1850 1,1862 1} > > {0 politics,4 1,13 1,16 1,32 1,33 1,64 1,73 1,79 1,89 1,100 1,114 > 1,136 > 1,147 1,148 1,150 1,151 1,157 1,160 1,163 1,165 1,176 1,184 1,185 > 1,189 > 1,192 1,199 1,206 1,216 1,217 1,227 1,242 1,255 1,258 1,259 1,261 > 1,275 > 1,276 1,279 1,293 1,298 1,299 1,301 1,302 1,315 1,327 1,337 1,349 > 1,351 > 1,363 1,367 1,371 1,383 1,396 1,416 1,422 1,428 1,440 1,445 1,479 > 1,506 > 1,536 1,542 1,564 1,565 1,578 1,600 1,628 1,636 1,668 1,673 1,740 > 1,745 > 1,752 1,799 1,804 1,805 1,831 1,843 1,860 1,863 1,864 1,871 1,873 > 1,878 > 1,889 1,902 1,923 1,935 1,950 1,983 1,1008 1,1012 1,1022 1,1067 1,1071 > 1,1074 1,1090 1,1096 1,1118 1,1141 1,1142 1,1143 1,1147 1,1154 1,1170 > 1,1181 1,1191 1,1195 1,1231 1,1232 1,1235 1,1245 1,1256 1,1350 1,1351 > 1,1369 1,1384 1,1391 1,1397 1,1398 1,1417 1,1423 1,1428 1,1495 1,1496 > 1,1514 1,1521 1,1542 1,1557 1,1568 1,1576 1,1617 1,1670 1,1682 1,1693 > 1,1750 1,1752 1,1776 1,1801 1,1805 1,1812 1,1814 1,1871 1,1873 1} > > {0 technology,3 1,19 1,20 1,23 1,24 1,27 1,30 1,31 1,43 1,58 1,61 > 1,62 1,67 > 1,68 1,71 1,75 1,80 1,81 1,82 1,103 1,104 1,121 1,153 1,156 1,165 > 1,210 > 1,211 1,214 1,220 1,257 1,264 1,282 1,296 1,335 1,342 1,357 1,362 > 1,379 > 1,387 1,388 1,406 1,415 1,423 1,424 1,427 1,444 1,461 1,469 1,471 > 1,481 > 1,482 1,483 1,490 1,531 1,539 1,548 1,553 1,557 1,577 1,590 1,591 > 1,597 > 1,601 1,605 1,611 1,613 1,653 1,662 1,666 1,680 1,690 1,692 1,693 > 1,716 > 1,721 1,728 1,732 1,750 1,754 1,761 1,768 1,770 1,780 1,781 1,784 > 1,789 > 1,797 1,802 1,812 1,819 1,836 1,840 1,852 1,862 1,863 1,865 1,872 > 1,874 > 1,877 1,880 1,893 1,908 1,910 1,926 1,927 1,930 1,932 1,934 1,937 > 1,944 > 1,1024 1,1033 1,1035 1,1075 1,1080 1,1090 1,1103 1,1112 1,1129 1,1136 > 1,1151 1,1152 1,1201 1,1214 1,1226 1,1239 1,1243 1,1244 1,1246 1,1251 > 1,1258 1,1266 1,1276 1,1277 1,1315 1,1347 1,1366 1,1367 1,1371 1,1378 > 1,1438 1,1439 1,1442 1,1450 1,1455 1,1467 1,1468 1,1473 1,1481 1,1497 > 1,1534 1,1549 1,1550 1,1561 1,1567 1,1583 1,1602 1,1618 1,1622 1,1625 > 1,1632 1,1643 1,1646 1,1656 1,1657 1,1664 1,1667 1,1712 1,1728 1,1730 > 1,1733 1,1763 1,1764 1,1769 1,1784 1,1789 1,1796 1,1801 1,1810 1,1816 > 1,1834 1,1835 1,1838 1,1845 1,1867 1,1872 1,1876 1,1878 1} > > {0 sports,44 1,57 1,149 1,164 1,243 1,261 1,281 1,293 1,310 1,320 > 1,341 > 1,399 1,402 1,437 1,444 1,509 1,539 1,544 1,573 1,608 1,626 1,683 > 1,701 > 1,713 1,734 1,743 1,749 1,758 1,759 1,810 1,848 1,888 1,909 1,924 > 1,980 > 1,1001 1,1064 1,1140 1,1155 1,1188 1,1195 1,1223 1,1234 1,1242 1,1247 > 1,1303 1,1316 1,1359 1,1360 1,1373 1,1375 1,1393 1,1403 1,1483 1,1551 > 1,1566 1,1607 1,1640 1,1663 1,1677 1,1679 1,1689 1,1708 1,1739 1,1746 > 1,1754 1,1756 1,1757 1,1772 1,1782 1,1801 1,1806 1,1820 1,1846 > 1,1861 1} > > I know that the format of the ARFF files for Mulan should be > different to > it (http://mulan.sourceforge.net/format.html) so the classifier > is not > working. > > Is there some java class to create the ARFF in the correct way? > Is there a way to work only with alphabetics chars ? Right now the > numbers > are taking in consideration as attributes. Use the stop words doesn't > remove them. May be I have to do a preprocessing with other class. > > Thank you for your help > Cheers > Mariela > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 2 > Date: Sat, 19 Jul 2014 11:41:57 +0300 > From: Grigorios Tsoumakas <gr...@cs... <mailto:gr...@cs...>> > Subject: Re: [Mulan-list] Arff Files > To: mul...@li... > <mailto:mul...@li...> > Cc: Stefanos Laskaridis <las...@cs... > <mailto:las...@cs...>> > Message-ID: <53C...@cs... > <mailto:53C...@cs...>> > Content-Type: text/plain; charset="iso-8859-1" > > Hi Mariela, > > The next step is to use the NominalToBinary weka filter so as to > convert > the "@@class@@" attribute to as many binary attributes as the > class values. > > However, there is another issue. Even if you put the same document > into > two different folders (if it is multi-labeled), Weka will not > understand > this and will create two instances in the arff for the same document. > > As copying the same document multiple times in different folders is > inefficient anyway, what is needed is new code reading the classes off > the documents themselves, where the labels typically are found and > constructing correctly the multi-label arffs files. This is an > interesting feature request for Mulan for the future. > > Best regards, > Grigorios Tsoumakas > > > On 18/07/2014 06:18 ??, Mariela Da Graca Guerra wrote: > > Hi, > > I'm new using Mulan and Weka and I'm trying to do a multi label > > classification of txt files in Java. In order to do it, I'm using > > TextDirectoryLoader and StringToWordVector from Weka but the ARFF > > generated doesn't have the format required by Mulan. > > > > Example of my code: > > > > TextDirectoryLoader loader = new TextDirectoryLoader(); > > > > String classesFolder = "MyPathToTxtFolder"; // This folder has one > > subfolder per class with the txt files > > > > loader.setDirectory(new File(classesFolder)); > > > > Instances dataRaw = loader.getDataSet(); > > > > StringToWordVector filter = new StringToWordVector(); > > > > filter.setWordsToKeep(1000000); > > > > filter.setUseStoplist(true); > > > > filter.setInputFormat(dataRaw); > > > > Instances dataFiltered = Filter.useFilter(dataRaw, filter); > > > > After this code, the arff file is printed and I has the following > > format in my text file: > > > > @relation 'MyFilename' > > > > @attribute @@class@@ {animals,politics,sports,technology} > > > > @attribute #WorldCuptrophy numeric > > > > @attribute $200m numeric > > > > @attribute $3 numeric > > > > @attribute Air numeric > > > > @attribute Airborne numeric > > > > @attribute Answer numeric > > > > @attribute Anthrax numeric > > > > @attribute Apple numeric > > > > @attribute Apple-IBM numeric > > > > ... > > > > @data > > > > {6 1,7 1,52 1,64 1,77 1,78 1,229 1,231 1,297 1,391 1,458 1,498 1,708 > > 1,731 1,756 1,762 1,800 1,801 1,813 1,828 1,833 1,839 1,856 1,963 > > 1,970 1,973 1,974 1,1019 1,1041 1,1052 1,1057 1,1059 1,1089 1,1145 > > 1,1152 1,1189 1,1292 1,1331 1,1368 1,1451 1,1560 1,1563 1,1654 > 1,1792 > > 1,1801 1,1865 1,1866 1,1877 1} > > > > {13 1,14 1,25 1,29 1,33 1,38 1,76 1,101 1,105 1,106 1,259 1,294 > 1,323 > > 1,329 1,334 1,366 1,414 1,431 1,466 1,473 1,488 1,493 1,499 1,509 > > 1,519 1,520 1,533 1,628 1,633 1,644 1,646 1,654 1,656 1,661 1,687 > > 1,720 1,748 1,829 1,872 1,894 1,911 1,918 1,925 1,939 1,950 1,956 > > 1,957 1,1002 1,1029 1,1035 1,1038 1,1064 1,1077 1,1078 1,1079 1,1098 > > 1,1117 1,1122 1,1123 1,1134 1,1148 1,1175 1,1236 1,1237 1,1250 > 1,1285 > > 1,1295 1,1296 1,1298 1,1299 1,1300 1,1301 1,1302 1,1314 1,1351 > 1,1352 > > 1,1361 1,1373 1,1380 1,1389 1,1404 1,1407 1,1408 1,1432 1,1456 > 1,1469 > > 1,1489 1,1491 1,1493 1,1501 1,1523 1,1540 1,1546 1,1564 1,1569 > 1,1570 > > 1,1571 1,1573 1,1595 1,1596 1,1610 1,1614 1,1615 1,1627 1,1634 > 1,1650 > > 1,1691 1,1692 1,1698 1,1699 1,1706 1,1741 1,1742 1,1743 1,1758 > 1,1791 > > 1,1794 1,1801 1,1802 1,1807 1,1841 1,1848 1,1850 1,1862 1} > > > > {0 politics,4 1,13 1,16 1,32 1,33 1,64 1,73 1,79 1,89 1,100 1,114 > > 1,136 1,147 1,148 1,150 1,151 1,157 1,160 1,163 1,165 1,176 1,184 > > 1,185 1,189 1,192 1,199 1,206 1,216 1,217 1,227 1,242 1,255 1,258 > > 1,259 1,261 1,275 1,276 1,279 1,293 1,298 1,299 1,301 1,302 1,315 > > 1,327 1,337 1,349 1,351 1,363 1,367 1,371 1,383 1,396 1,416 1,422 > > 1,428 1,440 1,445 1,479 1,506 1,536 1,542 1,564 1,565 1,578 1,600 > > 1,628 1,636 1,668 1,673 1,740 1,745 1,752 1,799 1,804 1,805 1,831 > > 1,843 1,860 1,863 1,864 1,871 1,873 1,878 1,889 1,902 1,923 1,935 > > 1,950 1,983 1,1008 1,1012 1,1022 1,1067 1,1071 1,1074 1,1090 1,1096 > > 1,1118 1,1141 1,1142 1,1143 1,1147 1,1154 1,1170 1,1181 1,1191 > 1,1195 > > 1,1231 1,1232 1,1235 1,1245 1,1256 1,1350 1,1351 1,1369 1,1384 > 1,1391 > > 1,1397 1,1398 1,1417 1,1423 1,1428 1,1495 1,1496 1,1514 1,1521 > 1,1542 > > 1,1557 1,1568 1,1576 1,1617 1,1670 1,1682 1,1693 1,1750 1,1752 > 1,1776 > > 1,1801 1,1805 1,1812 1,1814 1,1871 1,1873 1} > > > > {0 technology,3 1,19 1,20 1,23 1,24 1,27 1,30 1,31 1,43 1,58 > 1,61 1,62 > > 1,67 1,68 1,71 1,75 1,80 1,81 1,82 1,103 1,104 1,121 1,153 1,156 > 1,165 > > 1,210 1,211 1,214 1,220 1,257 1,264 1,282 1,296 1,335 1,342 1,357 > > 1,362 1,379 1,387 1,388 1,406 1,415 1,423 1,424 1,427 1,444 1,461 > > 1,469 1,471 1,481 1,482 1,483 1,490 1,531 1,539 1,548 1,553 1,557 > > 1,577 1,590 1,591 1,597 1,601 1,605 1,611 1,613 1,653 1,662 1,666 > > 1,680 1,690 1,692 1,693 1,716 1,721 1,728 1,732 1,750 1,754 1,761 > > 1,768 1,770 1,780 1,781 1,784 1,789 1,797 1,802 1,812 1,819 1,836 > > 1,840 1,852 1,862 1,863 1,865 1,872 1,874 1,877 1,880 1,893 1,908 > > 1,910 1,926 1,927 1,930 1,932 1,934 1,937 1,944 1,1024 1,1033 1,1035 > > 1,1075 1,1080 1,1090 1,1103 1,1112 1,1129 1,1136 1,1151 1,1152 > 1,1201 > > 1,1214 1,1226 1,1239 1,1243 1,1244 1,1246 1,1251 1,1258 1,1266 > 1,1276 > > 1,1277 1,1315 1,1347 1,1366 1,1367 1,1371 1,1378 1,1438 1,1439 > 1,1442 > > 1,1450 1,1455 1,1467 1,1468 1,1473 1,1481 1,1497 1,1534 1,1549 > 1,1550 > > 1,1561 1,1567 1,1583 1,1602 1,1618 1,1622 1,1625 1,1632 1,1643 > 1,1646 > > 1,1656 1,1657 1,1664 1,1667 1,1712 1,1728 1,1730 1,1733 1,1763 > 1,1764 > > 1,1769 1,1784 1,1789 1,1796 1,1801 1,1810 1,1816 1,1834 1,1835 > 1,1838 > > 1,1845 1,1867 1,1872 1,1876 1,1878 1} > > > > {0 sports,44 1,57 1,149 1,164 1,243 1,261 1,281 1,293 1,310 1,320 > > 1,341 1,399 1,402 1,437 1,444 1,509 1,539 1,544 1,573 1,608 1,626 > > 1,683 1,701 1,713 1,734 1,743 1,749 1,758 1,759 1,810 1,848 1,888 > > 1,909 1,924 1,980 1,1001 1,1064 1,1140 1,1155 1,1188 1,1195 1,1223 > > 1,1234 1,1242 1,1247 1,1303 1,1316 1,1359 1,1360 1,1373 1,1375 > 1,1393 > > 1,1403 1,1483 1,1551 1,1566 1,1607 1,1640 1,1663 1,1677 1,1679 > 1,1689 > > 1,1708 1,1739 1,1746 1,1754 1,1756 1,1757 1,1772 1,1782 1,1801 > 1,1806 > > 1,1820 1,1846 1,1861 1} > > > > > > I know that the format of the ARFF files for Mulan should be > different > > to it (http://mulan.sourceforge.net/format.html) so the > classifier is > > not working. > > > > Is there some java class to create the ARFF in the correct way? > > Is there a way to work only with alphabetics chars ? Right now the > > numbers are taking in consideration as attributes. Use the stop > words > > doesn't remove them. May be I have to do a preprocessing with other > > class. > > > > Thank you for your help > > Cheers > > Mariela > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > Want fast and easy access to all the code in your enterprise? > Index and > > search up to 200,000 lines of code with a free copy of Black Duck > > Code Sight - the same software that powers the world's largest code > > search on Ohloh, the Black Duck Open Hub! Try it now. > > http://p.sf.net/sfu/bds > > > > > > _______________________________________________ > > Mulan-list mailing list > > Mul...@li... > <mailto:Mul...@li...> > > https://lists.sourceforge.net/lists/listinfo/mulan-list > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? > Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > > ------------------------------ > > _______________________________________________ > Mulan-list mailing list > Mul...@li... > <mailto:Mul...@li...> > https://lists.sourceforge.net/lists/listinfo/mulan-list > > > End of Mulan-list Digest, Vol 37, Issue 1 > ***************************************** > > > > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > > > _______________________________________________ > Mulan-list mailing list > Mul...@li... > https://lists.sourceforge.net/lists/listinfo/mulan-list -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds ------------------------------ _______________________________________________ Mulan-list mailing list Mul...@li... https://lists.sourceforge.net/lists/listinfo/mulan-list End of Mulan-list Digest, Vol 37, Issue 3 ***************************************** |