Thread: [R-gregmisc-users] SF.net SVN: r-gregmisc: [1033] trunk/gdata/inst/doc
Brought to you by:
warnes
From: <gg...@us...> - 2006-11-30 21:32:18
|
Revision: 1033 http://svn.sourceforge.net/r-gregmisc/?rev=1033&view=rev Author: ggorjan Date: 2006-11-30 13:32:08 -0800 (Thu, 30 Nov 2006) Log Message: ----------- description of unknown methods Added Paths: ----------- trunk/gdata/inst/doc/unknown.pdf trunk/gdata/inst/doc/unknown.tex Added: trunk/gdata/inst/doc/unknown.pdf =================================================================== (Binary files differ) Property changes on: trunk/gdata/inst/doc/unknown.pdf ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/gdata/inst/doc/unknown.tex =================================================================== --- trunk/gdata/inst/doc/unknown.tex (rev 0) +++ trunk/gdata/inst/doc/unknown.tex 2006-11-30 21:32:08 UTC (rev 1033) @@ -0,0 +1,301 @@ +\documentclass[a4paper]{report} +\usepackage{Rnews} +\usepackage[round]{natbib} +\bibliographystyle{abbrvnat} + +\begin{document} + +\begin{article} + +\title{Working with unknown values} +\subtitle{The gdata package} +\author{by Gregor Gorjanc} + +\maketitle + +Published as \cite{Gorjanc}. + +\section{Introduction} + +Unknown or missing values can be represented in various ways. For example +SAS uses \code{.} (dot), while \R{} uses \code{NA}, which we can read as +Not Available. When we import data into \R{}, say via \code{read.table} or +its derivatives, conversion of blank fields to \code{NA} (according to +\code{read.table} help) is done for \code{logical}, \code{integer}, +\code{numeric} and \code{complex} classes. Additionally, \code{na.strings} +argument can be used to specify values that should also be converted to +\code{NA}. Inversely there is an argument \code{na} in \code{write.table} +and its derivatives to define value that will replace \code{NA} in exported +data. There are also other ways to import/export data into \R{} as +described in ``R Data Import/Export'' Manual \citep{RImportExportManual}, +however all approaches lack the possibility to define unknown value(s) for +particular column. It is possible that unknown value in one column is a +valid value in another column. For example I have seen many datasets where +values as 0, -9, 999 and specific dates are used as column specific unknown +values in one dataset. + +This note represents set of functions in \pkg{gdata}\footnote{from version + 2.3.1} package \citep{WarnesGdata}: \code{isUnknown}, \code{unknownToNA} +and \code{NAToUnknown}, which can help with testing for unknown values and +conversions between unknown values and \code{NA}. All three functions are +generic (S3) and were tested (at the time of writing) to work with: +\code{integer}, \code{numeric}, \code{character}, \code{factor}, +\code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, \code{data.frame} +and \code{matrix} classes. + +\section{Description with examples} + +The following examples show simple usage of this functions on +\code{numeric} and \code{factor} classes, where value \code{0} (beside +\code{NA}) should be treated as unknown value: + +\begin{smallverbatim} +> library(gdata) +> xNum <- c(0, 6, 0, 7, 8, 9, NA) +> isUnknown(x=xNum) +[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE +\end{smallverbatim} + +The default unknown value in \code{isUnknown} is \code{NA}, which means +that output is the same to \code{is.na} - at least for atomic +classes. However, we can pass argument \code{unknown} to define which +values should be treated as unknown. + +\begin{smallverbatim} +> isUnknown(x=xNum, unknown=0) +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +\end{smallverbatim} + +This skipped \code{NA}, but we can get expected answer after appropriately +adding \code{NA} into argument \code{unknown}. + +\begin{smallverbatim} +> isUnknown(x=xNum, unknown=c(0, NA)) +[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE +\end{smallverbatim} + +Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. +There is clearly no need to add \code{NA} here. This step is very handy +after importing data from external source, where many different unknown +values might be used. Argument \code{warning=TRUE} can be used, if there is +a need to be warned about ``original'' \code{NA}s. + +\begin{smallverbatim} +> xNum2 <- unknownToNA(x=xNum, unknown=0) +[1] NA 6 NA 7 8 9 NA +\end{smallverbatim} + +Prior to export from \R{}, we might want to change unknown value (\code{NA} +in \R{}) to some other value. Function \code{NAToUnknown} can be used for +this. + +\begin{smallverbatim} +> NAToUnknown(x=xNum2, unknown=999) +[1] 999 6 999 7 8 9 999 +\end{smallverbatim} + +Converting \code{NA} to value that already exists in \code{x} issues an +error, however \code{force=TRUE} can be used to overcome this if +needed. But be warned that there is no way back from this step. + +\begin{smallverbatim} +> NAToUnknown(x=xNum2, unknown=7, force=TRUE) +[1] 7 6 7 7 8 9 7 +\end{smallverbatim} + +Examples bellow show all peculiarities with \code{factor} class. +\code{unknownToNA} removes \code{unknown} value from levels and inversely +\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is +properly distinguished from \code{NA}. It can also be seen that argument +\code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need +not match class of \code{x} (otherwise factor should be used) as the test +is internally done with \code{\%in\%}, which nicely resolves coercing +issues. + +\begin{smallverbatim} +> xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")) +[1] 0 BA RA BA <NA> NA +Levels: 0 BA NA RA +> isUnknown(x=xFac) +[1] FALSE FALSE FALSE FALSE TRUE FALSE +> isUnknown(x=xFac, unknown=0) +[1] TRUE FALSE FALSE FALSE FALSE FALSE +> isUnknown(x=xFac, unknown=c(0, NA)) +[1] TRUE FALSE FALSE FALSE TRUE FALSE +> isUnknown(x=xFac, unknown=c(0, "NA")) +[1] TRUE FALSE FALSE FALSE FALSE TRUE +> isUnknown(x=xFac, unknown=c(0, "NA", NA)) +[1] TRUE FALSE FALSE FALSE TRUE TRUE + +> xFac <- unknownToNA(x=xFac, unknown=0) +[1] <NA> BA RA BA <NA> NA +Levels: BA NA RA +> xFac <- NAToUnknown(x=xFac, unknown=0) +[1] 0 BA RA BA 0 NA +Levels: 0 BA NA RA +Warning message: +new level is introduced: 0 +\end{smallverbatim} + +These two examples with numeric an factor classes are fairly simple and we +could get the same results with one or two lines of \R{} code. The real +benefit of presented set of functions is in \code{list} and +\code{data.frame} methods, where \code{data.frame} methods are merely +wrappers for \code{list} methods. + +We need additional flexibility for \code{list}/\code{data.frame} methods, +due to possibility of having multiple unknown values that can be different +among \code{list} components or \code{data.frame} columns. For these two +methods, argument \code{unknown} can be either a \code{vector} or +\code{list}, both possibly named. Of course, greater flexibility (defining +multiple unknown values per component/column) can be achieved with +a \code{list}. + +When \code{vector}/\code{list} passed to argument \code{unknown} is not +named, first value/component of a \code{vector}/\code{list} matches first +component/column of a \code{list}/\code{data.frame}. This can be quite +error prone, especially with \code{vectors}. Therefore, I encourage use of +a \code{list}. In case \code{vector}/\code{list} passed to argument +\code{unknown} is named, names are matched to names of \code{list} or +\code{data.frame}. If lengths of \code{unknown} and \code{list} or +\code{data.frame} do not match, recycling occurs. + +Example bellow shows usage of described functions on a list, that is +composed of previously defined and modified numeric (\code{xNum}) and +factor (\code{xFac}) classes. First function \code{isUnknown} is used with +\code{0} as unknown value. Note that we get \code{FALSE} for \code{NA}s as +has been the case in the first example. + +\begin{smallverbatim} +> xList <- list(a=xNum, b=xFac) +$a +[1] 0 6 0 7 8 9 NA + +$b +[1] 0 BA RA BA 0 NA +Levels: 0 BA NA RA +> isUnknown(x=xList, unknown=0) +$a +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE + +$b +[1] TRUE FALSE FALSE FALSE TRUE FALSE +\end{smallverbatim} + +We need to add \code{NA} as unknown value. However, we do not get the +expected result this way! + +\begin{smallverbatim} +> isUnknown(x=xList, unknown=c(0, NA)) +$a +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE + +$b +[1] FALSE FALSE FALSE FALSE FALSE FALSE +\end{smallverbatim} + +This is due to matching of values in argument \code{unknown} and components +in a \code{list} i.e. \code{0} is used for component \code{a} and \code{NA} +for component \code{b}. Therefore, it is less error prone and more +flexible to pass \code{list} (preferably named one) to argument +\code{unknown} as shown bellow. + +\begin{smallverbatim} +> xList1 <- unknownToNA(x=xList, ++ unknown=list(b=c(0, "NA"), a=0)) +$a +[1] NA 6 NA 7 8 9 NA + +$b +[1] <NA> BA RA BA <NA> <NA> +Levels: BA RA +\end{smallverbatim} + +Changing \code{NA}s to some other value (only one per component/column) can +be now something like this: + +\begin{smallverbatim} +> NAToUnknown(x=xList1, unknown=list(b="no", a=0)) +$a +[1] 0 6 0 7 8 9 0 + +$b +[1] no BA RA BA no no +Levels: BA no RA + +Warning message: +new level is introduced: no +\end{smallverbatim} + +Named component \code{.default} of a \code{list} passed to argument +\code{unknown} has a special meaning as it will match component/column with +that name and any other not defined in \code{unknown}. As such it is very +useful if the number of components/columns with the same unknown value(s) +is large. Imagine a wide \code{data.frame} named \code{df}. Now +\code{.default} can be used to define unknown value for several columns: + +\begin{smallverbatim} +> df <- unknownToNA(x=df, ++ unknown=(.default=0, ++ col1=999, ++ col2="unknown")) +\end{smallverbatim} + +If there is a need to work only on some components/columns you can of +course ``skip'' columns with standard \R{} mechanisms i.e. +with subsetting \code{list} or \code{data.frame} objects. + +\begin{smallverbatim} +> cols <- c("col1", "col2") +> df[, cols] <- unknownToNA(x=df[, cols], ++ unknown=(col1=999, ++ col2="unknown")) +\end{smallverbatim} + +\section{Summary} + +Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} +provide a nice interface to work with various representations of +unknown/missing values. Their use is meant primarily for shaping the data +after importing to or before exporting from \R{}. I welcome any comments or +suggestions. + +% \bibliography{refs} + +\begin{thebibliography}{1} +\providecommand{\natexlab}[1]{#1} +\providecommand{\url}[1]{\texttt{#1}} +\expandafter\ifx\csname urlstyle\endcsname\relax + \providecommand{\doi}[1]{doi: #1}\else + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi + +\bibitem[Gorjanc(2007)]{Gorjanc} +G.~Gorjanc. +\newblock Working with unknown values: the gdata package. +\newblock \emph{R News}, 1\penalty0 (1):\penalty0 ?--?, ? 2007. +\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-?.pdf}. + +\bibitem[{R Development Core Team}(2006)]{RImportExportManual} +{R Development Core Team}. +\newblock \emph{R Data Import/Export}, 2006. +\newblock URL \url{http://cran.r-project.org/manuals.html}. +\newblock ISBN 3-900051-10-0. + +\bibitem[Warnes.(2006)]{WarnesGdata} +G.~R. Warnes. +\newblock \emph{gdata: Various R programming tools for data manipulation}, + 2006. +\newblock URL + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. +\newblock R package version 2.3.1. Includes R source code and/or documentation + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. + +\end{thebibliography} + +\address{Gregor Gorjanc\\ + University of Ljubljana, Slovenia\\ +\email{gre...@bf...}} + +\end{article} + +\end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gg...@us...> - 2006-11-30 21:36:44
|
Revision: 1034 http://svn.sourceforge.net/r-gregmisc/?rev=1034&view=rev Author: ggorjan Date: 2006-11-30 13:36:40 -0800 (Thu, 30 Nov 2006) Log Message: ----------- description of mapLevels methods Added Paths: ----------- trunk/gdata/inst/doc/mapLevels.pdf trunk/gdata/inst/doc/mapLevels.tex Added: trunk/gdata/inst/doc/mapLevels.pdf =================================================================== (Binary files differ) Property changes on: trunk/gdata/inst/doc/mapLevels.pdf ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/gdata/inst/doc/mapLevels.tex =================================================================== --- trunk/gdata/inst/doc/mapLevels.tex (rev 0) +++ trunk/gdata/inst/doc/mapLevels.tex 2006-11-30 21:36:40 UTC (rev 1034) @@ -0,0 +1,255 @@ +\documentclass[a4paper]{report} +\usepackage{Rnews} +\usepackage[round]{natbib} +\bibliographystyle{abbrvnat} + +\begin{document} + +\begin{article} + +\title{Mapping levels of a factor} +\subtitle{The gdata package} +\author{by Gregor Gorjanc} + +\maketitle + +\section{Introduction} + +Factors use levels attribute to store information on mapping between +internal integer codes and character values i.e. levels. First level is +mapped to internal integer code 1 and so on. Although some users do not +like factors, their use is more efficient in terms of storage than for +character vectors. Additionally, there are many functions in base \R{} that +provide additional value for factors. Sometimes users need to work with +internal integer codes and mapping them back to factor, especially when +interfacing external programs. Mapping information is also of interest if +there are many factors that should have the same set of levels. This note +describes \code{mapLevels} function, which is an utility function for +mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1} +package \citep{WarnesGdata}. + +\section{Description with examples} + +Function \code{mapLevels()} is an (S3) generic function and works on +\code{factor} and \code{character} atomic classes. It also works on +\code{list} and \code{data.frame} objects with previously mentioned atomic +classes. Function \code{mapLevels} produces a so called ``map'' with names +and values. Names are levels, while values can be internal integer codes or +(possibly other) levels. This will be clarified later on. Class of this +``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic +or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame} +classes. The following example shows the creation and printout of such a +``map''. + +\begin{smallverbatim} +> library(gdata) +> (fac <- factor(c("B", "A", "Z", "D"))) +[1] B A Z D +Levels: A B D Z +> (map <- mapLevels(x=fac)) +A B D Z +1 2 3 4 +\end{smallverbatim} + +If we have to work with internal integer codes, we can transform factor to +integer and still get ``back the original factor'' with ``map'' used as +argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-} +is also an (S3) generic function and works on same classes as +\code{mapLevels} plus \code{integer} atomic class. + +\begin{smallverbatim} +> (int <- as.integer(fac)) +[1] 2 1 4 3 +> mapLevels(x=int) <- map +> int +[1] B A Z D +Levels: A B D Z +> identical(fac, int) +[1] TRUE +\end{smallverbatim} + +Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow), +but its print method unlists it for ease of inspection. ``Map'' from +example has all components of length 1. This is not mandatory as +\code{mapLevels<-} function is only a wrapper around workhorse function +\code{levels<-} and the later can accept \code{list} with components of +various lengths. + +\begin{smallverbatim} +> str(map) +List of 4 + $ A: int 1 + $ B: int 2 + $ D: int 3 + $ Z: int 4 + - attr(*, "class")= chr "levelsMap" +\end{smallverbatim} + +Although not of primary importance, this ``map'' can also be used to remap +factor levels as shown bellow. Components ``later'' in the map take over +the ``previous'' ones. Since this is not optimal I would rather recommend +other approaches for ``remapping'' the levels of a \code{factor}, say +\code{recode} in \pkg{car} package \citep{FoxCar}. + +\begin{smallverbatim} +> map[[2]] <- as.integer(c(1, 2)) +> map +A B B D Z +1 1 2 3 4 +> int <- as.integer(fac) +> mapLevels(x=int) <- map +> int +[1] B B Z D +Levels: A B D Z +\end{smallverbatim} + +Up to now examples showed ``map'' with internal integer codes for values +and levels for names. I call this integer ``map''. On the other hand +character ``map'' uses levels for values and (possibly other) levels for +names. This feature is a bit odd at first sight, but can be used to easily +unify levels and internal integer codes across several factors. Imagine +you have a factor that is for some reason split into two factors \code{f1} +and \code{f2} and that each factor does not have all levels. This is not +uncommon situation. + +\begin{smallverbatim} +> (f1 <- factor(c("A", "D", "C"))) +[1] A D C +Levels: A C D +> (f2 <- factor(c("B", "D", "C"))) +[1] B D C +Levels: B C D +\end{smallverbatim} + +If we work with this factors, we need to be careful as they do not have the +same set of levels. This can be solved with appropriately specifying +\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B", + "C", "D")} or with proper use of \code{levels<-} function. I say proper +as it is very tempting to use: + +\begin{smallverbatim} +> fTest <- f1 +> levels(fTest) <- c("A", "B", "C", "D") +> fTest +[1] A C B +Levels: A B C D +\end{smallverbatim} + +Above example extends set of levels, but also changes level of 2nd and 3rd +element in \code{fTest}! Proper use of \code{levels<-} (as shown in +\code{levels} help page) would be: + +\begin{smallverbatim} +> fTest <- f1 +> levels(fTest) <- list(A="A", B="B", C="C", D="D") +> fTest +[1] A D C +Levels: A B C D +\end{smallverbatim} + +Function \code{mapLevels} with character ``map'' can help us in such +scenarios to unify levels and internal integer codes across several +factors. Again the workhorse under this process is \code{levels<-} function +from base \R{}! Function \code{mapLevels<-} just controls the assignment of +(integer or character) ``map'' to \code{x}. Levels in \code{x} that match +``map'' values (internal integer codes or levels) are changed to ``map'' +names (possibly other levels) as shown in \code{levels} help page. Levels +that do not match are converted to \code{NA}. Integer ``map'' can be +applied to \code{integer} or \code{factor}, while character ``map'' can be +applied to \code{character} or \code{factor}. Result of \code{mapLevels<-} +is always a \code{factor} with possibly ``remapped'' levels. + +To get one joint character ``map'' for several factors, we need to +put factors in a \code{list} or \code{data.frame} and use arguments +\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to +unify levels and internal integer codes. + +\begin{smallverbatim} +> (bigMap <- mapLevels(x=list(f1, f2), codes=FALSE, ++ combine=TRUE)) + A B C D +"A" "B" "C" "D" +> mapLevels(f1) <- bigMap +> mapLevels(f2) <- bigMap +> f1 +[1] A D C +Levels: A B C D +> f2 +[1] B D C +Levels: A B C D +> cbind(as.character(f1), as.integer(f1), ++ as.character(f2), as.integer(f2)) + [,1] [,2] [,3] [,4] +[1,] "A" "1" "B" "2" +[2,] "D" "4" "D" "4" +[3,] "C" "3" "C" "3" +\end{smallverbatim} + +If we do not specify \code{combine=TRUE} (which is the default behaviour) +and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels} +returns ``map'' of class \code{listLevelsMap}. This is internally a +\code{list} of ``maps'' (\code{levelsMap} objects). Both +\code{listLevelsMap} and \code{levelsMap} objects can be passed to +\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when +length of \code{listLevelsMap} is not the same as number of +components/columns of a \code{list}/\code{data.frame}. + +Additional convenience methods are also implemented to ease the work with +``maps'': + +\begin{itemize} + +\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and + \code{as.listLevelsMap} for testing and coercion of user defined + ``maps'', + +\item \code{"["} for subsetting, + +\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap} + objects; argument \code{recursive=TRUE} can be used to coerce + \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2, + recursive=TRUE)} and + +\item \code{unique} and \code{sort} for \code{levelsMap}. + +\end{itemize} + +\section{Summary} + +Functions \code{mapLevels} and \code{mapLevels<-} can help users to map +internal integer codes to factor levels and unify levels as well as +internal integer codes among several factors. I welcome any comments or +suggestions. + +% \bibliography{refs} +\begin{thebibliography}{1} +\providecommand{\natexlab}[1]{#1} +\providecommand{\url}[1]{\texttt{#1}} +\expandafter\ifx\csname urlstyle\endcsname\relax + \providecommand{\doi}[1]{doi: #1}\else + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi + +\bibitem[Fox(2006)]{FoxCar} +J.~Fox. +\newblock \emph{car: Companion to Applied Regression}, 2006. +\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}. +\newblock R package version 1.1-1. + +\bibitem[Warnes.(2006)]{WarnesGdata} +G.~R. Warnes. +\newblock \emph{gdata: Various R programming tools for data manipulation}, + 2006. +\newblock URL + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. +\newblock R package version 2.3.1. Includes R source code and/or documentation + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. + +\end{thebibliography} + +\address{Gregor Gorjanc\\ + University of Ljubljana, Slovenia\\ +\email{gre...@bf...}} + +\end{article} + +\end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gg...@us...> - 2007-06-06 10:24:17
|
Revision: 1097 http://svn.sourceforge.net/r-gregmisc/?rev=1097&view=rev Author: ggorjan Date: 2007-06-06 03:24:15 -0700 (Wed, 06 Jun 2007) Log Message: ----------- last edits from newsletter Modified Paths: -------------- trunk/gdata/inst/doc/unknown.pdf trunk/gdata/inst/doc/unknown.tex Modified: trunk/gdata/inst/doc/unknown.pdf =================================================================== (Binary files differ) Modified: trunk/gdata/inst/doc/unknown.tex =================================================================== --- trunk/gdata/inst/doc/unknown.tex 2007-06-06 10:19:15 UTC (rev 1096) +++ trunk/gdata/inst/doc/unknown.tex 2007-06-06 10:24:15 UTC (rev 1097) @@ -7,108 +7,109 @@ \begin{article} -\title{Working with unknown values} -\subtitle{The gdata package} +\title{Working with Unknown Values} +\subtitle{The \pkg{gdata} package} \author{by Gregor Gorjanc} \maketitle -Published as \cite{Gorjanc}. +This vignette has been published as \cite{Gorjanc}. \section{Introduction} Unknown or missing values can be represented in various ways. For example -SAS uses \code{.} (dot), while \R{} uses \code{NA}, which we can read as +SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as Not Available. When we import data into \R{}, say via \code{read.table} or its derivatives, conversion of blank fields to \code{NA} (according to \code{read.table} help) is done for \code{logical}, \code{integer}, -\code{numeric} and \code{complex} classes. Additionally, \code{na.strings} +\code{numeric} and \code{complex} classes. Additionally, +the \code{na.strings} argument can be used to specify values that should also be converted to -\code{NA}. Inversely there is an argument \code{na} in \code{write.table} +\code{NA}. Inversely, there is an argument \code{na} in \code{write.table} and its derivatives to define value that will replace \code{NA} in exported data. There are also other ways to import/export data into \R{} as -described in ``R Data Import/Export'' Manual \citep{RImportExportManual}, -however all approaches lack the possibility to define unknown value(s) for -particular column. It is possible that unknown value in one column is a -valid value in another column. For example I have seen many datasets where -values as 0, -9, 999 and specific dates are used as column specific unknown -values in one dataset. +described in the {\emph R Data Import/Export} manual \citep{RImportExportManual}. +However, all approaches lack the possibility to define unknown value(s) for +some particular column. It is possible that an unknown value in one column is a +valid value in another column. For example, I have seen many datasets where +values such as 0, -9, 999 and specific dates are used as column specific unknown +values. -This note represents set of functions in \pkg{gdata}\footnote{from version - 2.3.1} package \citep{WarnesGdata}: \code{isUnknown}, \code{unknownToNA} -and \code{NAToUnknown}, which can help with testing for unknown values and -conversions between unknown values and \code{NA}. All three functions are -generic (S3) and were tested (at the time of writing) to work with: -\code{integer}, \code{numeric}, \code{character}, \code{factor}, -\code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, \code{data.frame} -and \code{matrix} classes. +This note describes a set of functions in package \pkg{gdata}\footnote{ +package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown}, +\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for +unknown values and conversions between unknown values and \code{NA}. All +three functions are generic (S3) and were tested (at the time of writing) +to work with: \code{integer}, \code{numeric}, \code{character}, +\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, +\code{data.frame} and \code{matrix} classes. \section{Description with examples} -The following examples show simple usage of this functions on +The following examples show simple usage of these functions on \code{numeric} and \code{factor} classes, where value \code{0} (beside -\code{NA}) should be treated as unknown value: +\code{NA}) should be treated as an unknown value: \begin{smallverbatim} -> library(gdata) +> library("gdata") > xNum <- c(0, 6, 0, 7, 8, 9, NA) > isUnknown(x=xNum) -[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE +[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE \end{smallverbatim} The default unknown value in \code{isUnknown} is \code{NA}, which means -that output is the same to \code{is.na} - at least for atomic -classes. However, we can pass argument \code{unknown} to define which -values should be treated as unknown. +that output is the same as \code{is.na} --- at least for atomic +classes. However, we can pass the argument \code{unknown} to define which +values should be treated as unknown: \begin{smallverbatim} > isUnknown(x=xNum, unknown=0) -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE \end{smallverbatim} -This skipped \code{NA}, but we can get expected answer after appropriately -adding \code{NA} into argument \code{unknown}. +This skipped \code{NA}, but we can get the expected answer after appropriately +adding \code{NA} into the argument \code{unknown}: \begin{smallverbatim} > isUnknown(x=xNum, unknown=c(0, NA)) -[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE +[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE \end{smallverbatim} Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. There is clearly no need to add \code{NA} here. This step is very handy -after importing data from external source, where many different unknown +after importing data from an external source, where many different unknown values might be used. Argument \code{warning=TRUE} can be used, if there is -a need to be warned about ``original'' \code{NA}s. +a need to be warned about ``original'' \code{NA}s: \begin{smallverbatim} > xNum2 <- unknownToNA(x=xNum, unknown=0) [1] NA 6 NA 7 8 9 NA \end{smallverbatim} -Prior to export from \R{}, we might want to change unknown value (\code{NA} +Prior to export from \R{}, we might want to change unknown values (\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be used for -this. +this: \begin{smallverbatim} > NAToUnknown(x=xNum2, unknown=999) [1] 999 6 999 7 8 9 999 \end{smallverbatim} -Converting \code{NA} to value that already exists in \code{x} issues an -error, however \code{force=TRUE} can be used to overcome this if -needed. But be warned that there is no way back from this step. +Converting \code{NA} to a value that already exists in \code{x} issues an +error, but \code{force=TRUE} can be used to overcome this if +needed. But be warned that there is no way back from this step: \begin{smallverbatim} > NAToUnknown(x=xNum2, unknown=7, force=TRUE) [1] 7 6 7 7 8 9 7 \end{smallverbatim} -Examples bellow show all peculiarities with \code{factor} class. +Examples below show all peculiarities with class \code{factor}. \code{unknownToNA} removes \code{unknown} value from levels and inversely \code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is -properly distinguished from \code{NA}. It can also be seen that argument +properly distinguished from \code{NA}. It can also be seen that the argument \code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need -not match class of \code{x} (otherwise factor should be used) as the test +not match the class of \code{x} (otherwise factor should be used) as the test is internally done with \code{\%in\%}, which nicely resolves coercing issues. @@ -137,33 +138,34 @@ new level is introduced: 0 \end{smallverbatim} -These two examples with numeric an factor classes are fairly simple and we +These two examples with classes \code{numeric} and \code{factor} are fairly simple and we could get the same results with one or two lines of \R{} code. The real -benefit of presented set of functions is in \code{list} and +benefit of the set of functions presented here is in \code{list} and \code{data.frame} methods, where \code{data.frame} methods are merely wrappers for \code{list} methods. We need additional flexibility for \code{list}/\code{data.frame} methods, -due to possibility of having multiple unknown values that can be different +due to possibly having multiple unknown values that can be different among \code{list} components or \code{data.frame} columns. For these two -methods, argument \code{unknown} can be either a \code{vector} or +methods, the argument \code{unknown} can be either a \code{vector} or \code{list}, both possibly named. Of course, greater flexibility (defining multiple unknown values per component/column) can be achieved with a \code{list}. -When \code{vector}/\code{list} passed to argument \code{unknown} is not -named, first value/component of a \code{vector}/\code{list} matches first +When a \code{vector}/\code{list} object passed to the argument \code{unknown} is not +named, the first value/component of a \code{vector}/\code{list} matches the first component/column of a \code{list}/\code{data.frame}. This can be quite -error prone, especially with \code{vectors}. Therefore, I encourage use of +error prone, especially with \code{vectors}. Therefore, I encourage the use of a \code{list}. In case \code{vector}/\code{list} passed to argument \code{unknown} is named, names are matched to names of \code{list} or \code{data.frame}. If lengths of \code{unknown} and \code{list} or \code{data.frame} do not match, recycling occurs. -Example bellow shows usage of described functions on a list, that is +The example below illustrates the application of the +described functions to a list which is composed of previously defined and modified numeric (\code{xNum}) and -factor (\code{xFac}) classes. First function \code{isUnknown} is used with -\code{0} as unknown value. Note that we get \code{FALSE} for \code{NA}s as +factor (\code{xFac}) classes. First, function \code{isUnknown} is used with +\code{0} as an unknown value. Note that we get \code{FALSE} for \code{NA}s as has been the case in the first example. \begin{smallverbatim} @@ -176,33 +178,33 @@ Levels: 0 BA NA RA > isUnknown(x=xList, unknown=0) $a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE $b -[1] TRUE FALSE FALSE FALSE TRUE FALSE +[1] TRUE FALSE FALSE FALSE TRUE FALSE \end{smallverbatim} -We need to add \code{NA} as unknown value. However, we do not get the +We need to add \code{NA} as an unknown value. However, we do not get the expected result this way! \begin{smallverbatim} > isUnknown(x=xList, unknown=c(0, NA)) $a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE $b [1] FALSE FALSE FALSE FALSE FALSE FALSE \end{smallverbatim} -This is due to matching of values in argument \code{unknown} and components -in a \code{list} i.e. \code{0} is used for component \code{a} and \code{NA} +This is due to matching of values in the argument \code{unknown} and components +in a \code{list}; i.e., \code{0} is used for component \code{a} and \code{NA} for component \code{b}. Therefore, it is less error prone and more -flexible to pass \code{list} (preferably named one) to argument -\code{unknown} as shown bellow. +flexible to pass a \code{list} (preferably a named list) to the argument +\code{unknown}, as shown below. \begin{smallverbatim} > xList1 <- unknownToNA(x=xList, -+ unknown=list(b=c(0, "NA"), a=0)) ++ unknown=list(b=c(0, "NA"), a=0)) $a [1] NA 6 NA 7 8 9 NA @@ -212,7 +214,7 @@ \end{smallverbatim} Changing \code{NA}s to some other value (only one per component/column) can -be now something like this: +be accomplished as follows: \begin{smallverbatim} > NAToUnknown(x=xList1, unknown=list(b="no", a=0)) @@ -227,11 +229,11 @@ new level is introduced: no \end{smallverbatim} -Named component \code{.default} of a \code{list} passed to argument -\code{unknown} has a special meaning as it will match component/column with +A named component \code{.default} of a \code{list} passed to argument +\code{unknown} has a special meaning as it will match a component/column with that name and any other not defined in \code{unknown}. As such it is very useful if the number of components/columns with the same unknown value(s) -is large. Imagine a wide \code{data.frame} named \code{df}. Now +is large. Consider a wide \code{data.frame} named \code{df}. Now \code{.default} can be used to define unknown value for several columns: \begin{smallverbatim} @@ -242,8 +244,8 @@ \end{smallverbatim} If there is a need to work only on some components/columns you can of -course ``skip'' columns with standard \R{} mechanisms i.e. -with subsetting \code{list} or \code{data.frame} objects. +course ``skip'' columns with standard \R{} mechanisms, i.e., +by subsetting \code{list} or \code{data.frame} objects: \begin{smallverbatim} > cols <- c("col1", "col2") @@ -255,7 +257,7 @@ \section{Summary} Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} -provide a nice interface to work with various representations of +provide a useful interface to work with various representations of unknown/missing values. Their use is meant primarily for shaping the data after importing to or before exporting from \R{}. I welcome any comments or suggestions. @@ -272,8 +274,8 @@ \bibitem[Gorjanc(2007)]{Gorjanc} G.~Gorjanc. \newblock Working with unknown values: the gdata package. -\newblock \emph{R News}, 1\penalty0 (1):\penalty0 ?--?, ? 2007. -\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-?.pdf}. +\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007. +\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}. \bibitem[{R Development Core Team}(2006)]{RImportExportManual} {R Development Core Team}. @@ -281,7 +283,7 @@ \newblock URL \url{http://cran.r-project.org/manuals.html}. \newblock ISBN 3-900051-10-0. -\bibitem[Warnes.(2006)]{WarnesGdata} +\bibitem[Warnes (2006)]{WarnesGdata} G.~R. Warnes. \newblock \emph{gdata: Various R programming tools for data manipulation}, 2006. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gg...@us...> - 2007-08-20 10:00:43
|
Revision: 1149 http://r-gregmisc.svn.sourceforge.net/r-gregmisc/?rev=1149&view=rev Author: ggorjan Date: 2007-08-20 03:00:38 -0700 (Mon, 20 Aug 2007) Log Message: ----------- a real vignette Modified Paths: -------------- trunk/gdata/inst/doc/unknown.pdf Added Paths: ----------- trunk/gdata/inst/doc/unknown.Rnw Removed Paths: ------------- trunk/gdata/inst/doc/unknown.tex Added: trunk/gdata/inst/doc/unknown.Rnw =================================================================== --- trunk/gdata/inst/doc/unknown.Rnw (rev 0) +++ trunk/gdata/inst/doc/unknown.Rnw 2007-08-20 10:00:38 UTC (rev 1149) @@ -0,0 +1,272 @@ + +%\VignetteIndexEntry{Working with Unknown Values} +%\VignettePackage{gdata} +%\VignetteKeywords{unknown, missing, manip} + +\documentclass[a4paper]{report} +\usepackage{Rnews} +\usepackage[round]{natbib} +\bibliographystyle{abbrvnat} + +\usepackage{Sweave} +\SweaveOpts{strip.white=all, keep.source=TRUE} + +\begin{document} + +\begin{article} + +\title{Working with Unknown Values} +\subtitle{The \pkg{gdata} package} +\author{by Gregor Gorjanc} + +\maketitle + +This vignette has been published as \cite{Gorjanc}. + +\section{Introduction} + +Unknown or missing values can be represented in various ways. For example +SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as +Not Available. When we import data into \R{}, say via \code{read.table} or +its derivatives, conversion of blank fields to \code{NA} (according to +\code{read.table} help) is done for \code{logical}, \code{integer}, +\code{numeric} and \code{complex} classes. Additionally, the +\code{na.strings} argument can be used to specify values that should also +be converted to \code{NA}. Inversely, there is an argument \code{na} in +\code{write.table} and its derivatives to define value that will replace +\code{NA} in exported data. There are also other ways to import/export data +into \R{} as described in the {\emph R Data Import/Export} manual +\citep{RImportExportManual}. However, all approaches lack the possibility +to define unknown value(s) for some particular column. It is possible that +an unknown value in one column is a valid value in another column. For +example, I have seen many datasets where values such as 0, -9, 999 and +specific dates are used as column specific unknown values. + +This note describes a set of functions in package \pkg{gdata}\footnote{ + package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown}, +\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for +unknown values and conversions between unknown values and \code{NA}. All +three functions are generic (S3) and were tested (at the time of writing) +to work with: \code{integer}, \code{numeric}, \code{character}, +\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, +\code{data.frame} and \code{matrix} classes. + +\section{Description with examples} + +The following examples show simple usage of these functions on +\code{numeric} and \code{factor} classes, where value \code{0} (beside +\code{NA}) should be treated as an unknown value: + +<<ex01>>= +library("gdata") +xNum <- c(0, 6, 0, 7, 8, 9, NA) +isUnknown(x=xNum) +@ + +The default unknown value in \code{isUnknown} is \code{NA}, which means +that output is the same as \code{is.na} --- at least for atomic +classes. However, we can pass the argument \code{unknown} to define which +values should be treated as unknown: + +<<ex02>>= +isUnknown(x=xNum, unknown=0) +@ + +This skipped \code{NA}, but we can get the expected answer after +appropriately adding \code{NA} into the argument \code{unknown}: + +<<ex03>>= +isUnknown(x=xNum, unknown=c(0, NA)) +@ + +Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. +There is clearly no need to add \code{NA} here. This step is very handy +after importing data from an external source, where many different unknown +values might be used. Argument \code{warning=TRUE} can be used, if there is +a need to be warned about ``original'' \code{NA}s: + +<<ex04>>= +(xNum2 <- unknownToNA(x=xNum, unknown=0)) +@ + +Prior to export from \R{}, we might want to change unknown values +(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be +used for this: + +<<ex05>>= +NAToUnknown(x=xNum2, unknown=999) +@ + +Converting \code{NA} to a value that already exists in \code{x} issues an +error, but \code{force=TRUE} can be used to overcome this if needed. But be +warned that there is no way back from this step: + +<<ex06>>= +NAToUnknown(x=xNum2, unknown=7, force=TRUE) +@ + +Examples below show all peculiarities with class \code{factor}. +\code{unknownToNA} removes \code{unknown} value from levels and inversely +\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is +properly distinguished from \code{NA}. It can also be seen that the +argument \code{unknown} in functions \code{isUnknown} and +\code{unknownToNA} need not match the class of \code{x} (otherwise factor +should be used) as the test is internally done with \code{\%in\%}, which +nicely resolves coercing issues. + +<<ex07>>= +(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA"))) +isUnknown(x=xFac) +isUnknown(x=xFac, unknown=0) +isUnknown(x=xFac, unknown=c(0, NA)) +isUnknown(x=xFac, unknown=c(0, "NA")) +isUnknown(x=xFac, unknown=c(0, "NA", NA)) + +(xFac <- unknownToNA(x=xFac, unknown=0)) +(xFac <- NAToUnknown(x=xFac, unknown=0)) +@ + +These two examples with classes \code{numeric} and \code{factor} are fairly +simple and we could get the same results with one or two lines of \R{} +code. The real benefit of the set of functions presented here is in +\code{list} and \code{data.frame} methods, where \code{data.frame} methods +are merely wrappers for \code{list} methods. + +We need additional flexibility for \code{list}/\code{data.frame} methods, +due to possibly having multiple unknown values that can be different among +\code{list} components or \code{data.frame} columns. For these two methods, +the argument \code{unknown} can be either a \code{vector} or \code{list}, +both possibly named. Of course, greater flexibility (defining multiple +unknown values per component/column) can be achieved with a \code{list}. + +When a \code{vector}/\code{list} object passed to the argument +\code{unknown} is not named, the first value/component of a +\code{vector}/\code{list} matches the first component/column of a +\code{list}/\code{data.frame}. This can be quite error prone, especially +with \code{vectors}. Therefore, I encourage the use of a \code{list}. In +case \code{vector}/\code{list} passed to argument \code{unknown} is named, +names are matched to names of \code{list} or \code{data.frame}. If lengths +of \code{unknown} and \code{list} or \code{data.frame} do not match, +recycling occurs. + +The example below illustrates the application of the described functions to +a list which is composed of previously defined and modified numeric +(\code{xNum}) and factor (\code{xFac}) classes. First, function +\code{isUnknown} is used with \code{0} as an unknown value. Note that we +get \code{FALSE} for \code{NA}s as has been the case in the first example. + +<<ex08>>= +(xList <- list(a=xNum, b=xFac)) +isUnknown(x=xList, unknown=0) +@ + +We need to add \code{NA} as an unknown value. However, we do not get the +expected result this way! + +<<ex09>>= +isUnknown(x=xList, unknown=c(0, NA)) +@ + +This is due to matching of values in the argument \code{unknown} and +components in a \code{list}; i.e., \code{0} is used for component \code{a} +and \code{NA} for component \code{b}. Therefore, it is less error prone +and more flexible to pass a \code{list} (preferably a named list) to the +argument \code{unknown}, as shown below. + +<<ex10>>= +(xList1 <- unknownToNA(x=xList, + unknown=list(b=c(0, "NA"), + a=0))) +@ + +Changing \code{NA}s to some other value (only one per component/column) can +be accomplished as follows: + +<<ex11>>= +NAToUnknown(x=xList1, + unknown=list(b="no", a=0)) +@ + +A named component \code{.default} of a \code{list} passed to argument +\code{unknown} has a special meaning as it will match a component/column +with that name and any other not defined in \code{unknown}. As such it is +very useful if the number of components/columns with the same unknown +value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now +\code{.default} can be used to define unknown value for several columns: + +<<ex12, echo=FALSE>>= +df <- data.frame(col1=c(0, 1, 999, 2), + col2=c("a", "b", "c", "unknown"), + col3=c(0, 1, 2, 3), + col4=c(0, 1, 2, 2)) +@ + +<<ex13>>= +tmp <- list(.default=0, + col1=999, + col2="unknown") +(df2 <- unknownToNA(x=df, + unknown=tmp)) +@ + +If there is a need to work only on some components/columns you can of +course ``skip'' columns with standard \R{} mechanisms, i.e., +by subsetting \code{list} or \code{data.frame} objects: + +<<ex14>>= +df2 <- df +cols <- c("col1", "col2") +tmp <- list(col1=999, + col2="unknown") +df2[, cols] <- unknownToNA(x=df[, cols], + unknown=tmp) +df2 +@ + +\section{Summary} + +Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} +provide a useful interface to work with various representations of +unknown/missing values. Their use is meant primarily for shaping the data +after importing to or before exporting from \R{}. I welcome any comments or +suggestions. + +% \bibliography{refs} + +\begin{thebibliography}{1} +\providecommand{\natexlab}[1]{#1} +\providecommand{\url}[1]{\texttt{#1}} +\expandafter\ifx\csname urlstyle\endcsname\relax + \providecommand{\doi}[1]{doi: #1}\else + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi + +\bibitem[Gorjanc(2007)]{Gorjanc} +G.~Gorjanc. +\newblock Working with unknown values: the gdata package. +\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007. +\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}. + +\bibitem[{R Development Core Team}(2006)]{RImportExportManual} +{R Development Core Team}. +\newblock \emph{R Data Import/Export}, 2006. +\newblock URL \url{http://cran.r-project.org/manuals.html}. +\newblock ISBN 3-900051-10-0. + +\bibitem[Warnes (2006)]{WarnesGdata} +G.~R. Warnes. +\newblock \emph{gdata: Various R programming tools for data manipulation}, + 2006. +\newblock URL + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. +\newblock R package version 2.3.1. Includes R source code and/or documentation + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. + +\end{thebibliography} + +\address{Gregor Gorjanc\\ + University of Ljubljana, Slovenia\\ +\email{gre...@bf...}} + +\end{article} + +\end{document} Property changes on: trunk/gdata/inst/doc/unknown.Rnw ___________________________________________________________________ Name: svn:keywords + Id Modified: trunk/gdata/inst/doc/unknown.pdf =================================================================== (Binary files differ) Deleted: trunk/gdata/inst/doc/unknown.tex =================================================================== --- trunk/gdata/inst/doc/unknown.tex 2007-08-20 08:18:26 UTC (rev 1148) +++ trunk/gdata/inst/doc/unknown.tex 2007-08-20 10:00:38 UTC (rev 1149) @@ -1,303 +0,0 @@ -\documentclass[a4paper]{report} -\usepackage{Rnews} -\usepackage[round]{natbib} -\bibliographystyle{abbrvnat} - -\begin{document} - -\begin{article} - -\title{Working with Unknown Values} -\subtitle{The \pkg{gdata} package} -\author{by Gregor Gorjanc} - -\maketitle - -This vignette has been published as \cite{Gorjanc}. - -\section{Introduction} - -Unknown or missing values can be represented in various ways. For example -SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as -Not Available. When we import data into \R{}, say via \code{read.table} or -its derivatives, conversion of blank fields to \code{NA} (according to -\code{read.table} help) is done for \code{logical}, \code{integer}, -\code{numeric} and \code{complex} classes. Additionally, -the \code{na.strings} -argument can be used to specify values that should also be converted to -\code{NA}. Inversely, there is an argument \code{na} in \code{write.table} -and its derivatives to define value that will replace \code{NA} in exported -data. There are also other ways to import/export data into \R{} as -described in the {\emph R Data Import/Export} manual \citep{RImportExportManual}. -However, all approaches lack the possibility to define unknown value(s) for -some particular column. It is possible that an unknown value in one column is a -valid value in another column. For example, I have seen many datasets where -values such as 0, -9, 999 and specific dates are used as column specific unknown -values. - -This note describes a set of functions in package \pkg{gdata}\footnote{ -package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown}, -\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for -unknown values and conversions between unknown values and \code{NA}. All -three functions are generic (S3) and were tested (at the time of writing) -to work with: \code{integer}, \code{numeric}, \code{character}, -\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, -\code{data.frame} and \code{matrix} classes. - -\section{Description with examples} - -The following examples show simple usage of these functions on -\code{numeric} and \code{factor} classes, where value \code{0} (beside -\code{NA}) should be treated as an unknown value: - -\begin{smallverbatim} -> library("gdata") -> xNum <- c(0, 6, 0, 7, 8, 9, NA) -> isUnknown(x=xNum) -[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE -\end{smallverbatim} - -The default unknown value in \code{isUnknown} is \code{NA}, which means -that output is the same as \code{is.na} --- at least for atomic -classes. However, we can pass the argument \code{unknown} to define which -values should be treated as unknown: - -\begin{smallverbatim} -> isUnknown(x=xNum, unknown=0) -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE -\end{smallverbatim} - -This skipped \code{NA}, but we can get the expected answer after appropriately -adding \code{NA} into the argument \code{unknown}: - -\begin{smallverbatim} -> isUnknown(x=xNum, unknown=c(0, NA)) -[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE -\end{smallverbatim} - -Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. -There is clearly no need to add \code{NA} here. This step is very handy -after importing data from an external source, where many different unknown -values might be used. Argument \code{warning=TRUE} can be used, if there is -a need to be warned about ``original'' \code{NA}s: - -\begin{smallverbatim} -> xNum2 <- unknownToNA(x=xNum, unknown=0) -[1] NA 6 NA 7 8 9 NA -\end{smallverbatim} - -Prior to export from \R{}, we might want to change unknown values (\code{NA} -in \R{}) to some other value. Function \code{NAToUnknown} can be used for -this: - -\begin{smallverbatim} -> NAToUnknown(x=xNum2, unknown=999) -[1] 999 6 999 7 8 9 999 -\end{smallverbatim} - -Converting \code{NA} to a value that already exists in \code{x} issues an -error, but \code{force=TRUE} can be used to overcome this if -needed. But be warned that there is no way back from this step: - -\begin{smallverbatim} -> NAToUnknown(x=xNum2, unknown=7, force=TRUE) -[1] 7 6 7 7 8 9 7 -\end{smallverbatim} - -Examples below show all peculiarities with class \code{factor}. -\code{unknownToNA} removes \code{unknown} value from levels and inversely -\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is -properly distinguished from \code{NA}. It can also be seen that the argument -\code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need -not match the class of \code{x} (otherwise factor should be used) as the test -is internally done with \code{\%in\%}, which nicely resolves coercing -issues. - -\begin{smallverbatim} -> xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")) -[1] 0 BA RA BA <NA> NA -Levels: 0 BA NA RA -> isUnknown(x=xFac) -[1] FALSE FALSE FALSE FALSE TRUE FALSE -> isUnknown(x=xFac, unknown=0) -[1] TRUE FALSE FALSE FALSE FALSE FALSE -> isUnknown(x=xFac, unknown=c(0, NA)) -[1] TRUE FALSE FALSE FALSE TRUE FALSE -> isUnknown(x=xFac, unknown=c(0, "NA")) -[1] TRUE FALSE FALSE FALSE FALSE TRUE -> isUnknown(x=xFac, unknown=c(0, "NA", NA)) -[1] TRUE FALSE FALSE FALSE TRUE TRUE - -> xFac <- unknownToNA(x=xFac, unknown=0) -[1] <NA> BA RA BA <NA> NA -Levels: BA NA RA -> xFac <- NAToUnknown(x=xFac, unknown=0) -[1] 0 BA RA BA 0 NA -Levels: 0 BA NA RA -Warning message: -new level is introduced: 0 -\end{smallverbatim} - -These two examples with classes \code{numeric} and \code{factor} are fairly simple and we -could get the same results with one or two lines of \R{} code. The real -benefit of the set of functions presented here is in \code{list} and -\code{data.frame} methods, where \code{data.frame} methods are merely -wrappers for \code{list} methods. - -We need additional flexibility for \code{list}/\code{data.frame} methods, -due to possibly having multiple unknown values that can be different -among \code{list} components or \code{data.frame} columns. For these two -methods, the argument \code{unknown} can be either a \code{vector} or -\code{list}, both possibly named. Of course, greater flexibility (defining -multiple unknown values per component/column) can be achieved with -a \code{list}. - -When a \code{vector}/\code{list} object passed to the argument \code{unknown} is not -named, the first value/component of a \code{vector}/\code{list} matches the first -component/column of a \code{list}/\code{data.frame}. This can be quite -error prone, especially with \code{vectors}. Therefore, I encourage the use of -a \code{list}. In case \code{vector}/\code{list} passed to argument -\code{unknown} is named, names are matched to names of \code{list} or -\code{data.frame}. If lengths of \code{unknown} and \code{list} or -\code{data.frame} do not match, recycling occurs. - -The example below illustrates the application of the -described functions to a list which is -composed of previously defined and modified numeric (\code{xNum}) and -factor (\code{xFac}) classes. First, function \code{isUnknown} is used with -\code{0} as an unknown value. Note that we get \code{FALSE} for \code{NA}s as -has been the case in the first example. - -\begin{smallverbatim} -> xList <- list(a=xNum, b=xFac) -$a -[1] 0 6 0 7 8 9 NA - -$b -[1] 0 BA RA BA 0 NA -Levels: 0 BA NA RA -> isUnknown(x=xList, unknown=0) -$a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE - -$b -[1] TRUE FALSE FALSE FALSE TRUE FALSE -\end{smallverbatim} - -We need to add \code{NA} as an unknown value. However, we do not get the -expected result this way! - -\begin{smallverbatim} -> isUnknown(x=xList, unknown=c(0, NA)) -$a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE - -$b -[1] FALSE FALSE FALSE FALSE FALSE FALSE -\end{smallverbatim} - -This is due to matching of values in the argument \code{unknown} and components -in a \code{list}; i.e., \code{0} is used for component \code{a} and \code{NA} -for component \code{b}. Therefore, it is less error prone and more -flexible to pass a \code{list} (preferably a named list) to the argument -\code{unknown}, as shown below. - -\begin{smallverbatim} -> xList1 <- unknownToNA(x=xList, -+ unknown=list(b=c(0, "NA"), a=0)) -$a -[1] NA 6 NA 7 8 9 NA - -$b -[1] <NA> BA RA BA <NA> <NA> -Levels: BA RA -\end{smallverbatim} - -Changing \code{NA}s to some other value (only one per component/column) can -be accomplished as follows: - -\begin{smallverbatim} -> NAToUnknown(x=xList1, unknown=list(b="no", a=0)) -$a -[1] 0 6 0 7 8 9 0 - -$b -[1] no BA RA BA no no -Levels: BA no RA - -Warning message: -new level is introduced: no -\end{smallverbatim} - -A named component \code{.default} of a \code{list} passed to argument -\code{unknown} has a special meaning as it will match a component/column with -that name and any other not defined in \code{unknown}. As such it is very -useful if the number of components/columns with the same unknown value(s) -is large. Consider a wide \code{data.frame} named \code{df}. Now -\code{.default} can be used to define unknown value for several columns: - -\begin{smallverbatim} -> df <- unknownToNA(x=df, -+ unknown=(.default=0, -+ col1=999, -+ col2="unknown")) -\end{smallverbatim} - -If there is a need to work only on some components/columns you can of -course ``skip'' columns with standard \R{} mechanisms, i.e., -by subsetting \code{list} or \code{data.frame} objects: - -\begin{smallverbatim} -> cols <- c("col1", "col2") -> df[, cols] <- unknownToNA(x=df[, cols], -+ unknown=(col1=999, -+ col2="unknown")) -\end{smallverbatim} - -\section{Summary} - -Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} -provide a useful interface to work with various representations of -unknown/missing values. Their use is meant primarily for shaping the data -after importing to or before exporting from \R{}. I welcome any comments or -suggestions. - -% \bibliography{refs} - -\begin{thebibliography}{1} -\providecommand{\natexlab}[1]{#1} -\providecommand{\url}[1]{\texttt{#1}} -\expandafter\ifx\csname urlstyle\endcsname\relax - \providecommand{\doi}[1]{doi: #1}\else - \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi - -\bibitem[Gorjanc(2007)]{Gorjanc} -G.~Gorjanc. -\newblock Working with unknown values: the gdata package. -\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007. -\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}. - -\bibitem[{R Development Core Team}(2006)]{RImportExportManual} -{R Development Core Team}. -\newblock \emph{R Data Import/Export}, 2006. -\newblock URL \url{http://cran.r-project.org/manuals.html}. -\newblock ISBN 3-900051-10-0. - -\bibitem[Warnes (2006)]{WarnesGdata} -G.~R. Warnes. -\newblock \emph{gdata: Various R programming tools for data manipulation}, - 2006. -\newblock URL - \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. -\newblock R package version 2.3.1. Includes R source code and/or documentation - contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. - -\end{thebibliography} - -\address{Gregor Gorjanc\\ - University of Ljubljana, Slovenia\\ -\email{gre...@bf...}} - -\end{article} - -\end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <gg...@us...> - 2007-08-20 10:17:41
|
Revision: 1150 http://r-gregmisc.svn.sourceforge.net/r-gregmisc/?rev=1150&view=rev Author: ggorjan Date: 2007-08-20 03:17:37 -0700 (Mon, 20 Aug 2007) Log Message: ----------- a real vignette Modified Paths: -------------- trunk/gdata/inst/doc/mapLevels.pdf trunk/gdata/inst/doc/mapLevels.tex Added Paths: ----------- trunk/gdata/inst/doc/mapLevels.Rnw Added: trunk/gdata/inst/doc/mapLevels.Rnw =================================================================== --- trunk/gdata/inst/doc/mapLevels.Rnw (rev 0) +++ trunk/gdata/inst/doc/mapLevels.Rnw 2007-08-20 10:17:37 UTC (rev 1150) @@ -0,0 +1,229 @@ + +%\VignetteIndexEntry{Mapping levels of a factor} +%\VignettePackage{gdata} +%\VignetteKeywords{levels, factor, manip} + +\documentclass[a4paper]{report} +\usepackage{Rnews} +\usepackage[round]{natbib} +\bibliographystyle{abbrvnat} + +\usepackage{Sweave} +\SweaveOpts{strip.white=all, keep.source=TRUE} + +\begin{document} + +\begin{article} + +\title{Mapping levels of a factor} +\subtitle{The \pkg{gdata} package} +\author{by Gregor Gorjanc} + +\maketitle + +\section{Introduction} + +Factors use levels attribute to store information on mapping between +internal integer codes and character values i.e. levels. First level is +mapped to internal integer code 1 and so on. Although some users do not +like factors, their use is more efficient in terms of storage than for +character vectors. Additionally, there are many functions in base \R{} that +provide additional value for factors. Sometimes users need to work with +internal integer codes and mapping them back to factor, especially when +interfacing external programs. Mapping information is also of interest if +there are many factors that should have the same set of levels. This note +describes \code{mapLevels} function, which is an utility function for +mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1} +package \citep{WarnesGdata}. + +\section{Description with examples} + +Function \code{mapLevels()} is an (S3) generic function and works on +\code{factor} and \code{character} atomic classes. It also works on +\code{list} and \code{data.frame} objects with previously mentioned atomic +classes. Function \code{mapLevels} produces a so called ``map'' with names +and values. Names are levels, while values can be internal integer codes or +(possibly other) levels. This will be clarified later on. Class of this +``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic +or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame} +classes. The following example shows the creation and printout of such a +``map''. + +<<ex01>>= +library(gdata) +(fac <- factor(c("B", "A", "Z", "D"))) +(map <- mapLevels(x=fac)) +@ + +If we have to work with internal integer codes, we can transform factor to +integer and still get ``back the original factor'' with ``map'' used as +argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-} +is also an (S3) generic function and works on same classes as +\code{mapLevels} plus \code{integer} atomic class. + +<<ex02>>= +(int <- as.integer(fac)) +mapLevels(x=int) <- map +int +identical(fac, int) +@ + +Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow), +but its print method unlists it for ease of inspection. ``Map'' from +example has all components of length 1. This is not mandatory as +\code{mapLevels<-} function is only a wrapper around workhorse function +\code{levels<-} and the later can accept \code{list} with components of +various lengths. + +<<ex03>>= +str(map) +@ + +Although not of primary importance, this ``map'' can also be used to remap +factor levels as shown bellow. Components ``later'' in the map take over +the ``previous'' ones. Since this is not optimal I would rather recommend +other approaches for ``remapping'' the levels of a \code{factor}, say +\code{recode} in \pkg{car} package \citep{FoxCar}. + +<<ex04>>= +map[[2]] <- as.integer(c(1, 2)) +map +int <- as.integer(fac) +mapLevels(x=int) <- map +int +@ + +Up to now examples showed ``map'' with internal integer codes for values +and levels for names. I call this integer ``map''. On the other hand +character ``map'' uses levels for values and (possibly other) levels for +names. This feature is a bit odd at first sight, but can be used to easily +unify levels and internal integer codes across several factors. Imagine +you have a factor that is for some reason split into two factors \code{f1} +and \code{f2} and that each factor does not have all levels. This is not +uncommon situation. + +<<ex05>>= +(f1 <- factor(c("A", "D", "C"))) +(f2 <- factor(c("B", "D", "C"))) +@ + +If we work with this factors, we need to be careful as they do not have the +same set of levels. This can be solved with appropriately specifying +\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B", + "C", "D")} or with proper use of \code{levels<-} function. I say proper +as it is very tempting to use: + +<<ex06>>= +fTest <- f1 +levels(fTest) <- c("A", "B", "C", "D") +fTest +@ + +Above example extends set of levels, but also changes level of 2nd and 3rd +element in \code{fTest}! Proper use of \code{levels<-} (as shown in +\code{levels} help page) would be: + +<<ex07>>= +fTest <- f1 +levels(fTest) <- list(A="A", B="B", + C="C", D="D") +fTest +@ + +Function \code{mapLevels} with character ``map'' can help us in such +scenarios to unify levels and internal integer codes across several +factors. Again the workhorse under this process is \code{levels<-} function +from base \R{}! Function \code{mapLevels<-} just controls the assignment of +(integer or character) ``map'' to \code{x}. Levels in \code{x} that match +``map'' values (internal integer codes or levels) are changed to ``map'' +names (possibly other levels) as shown in \code{levels} help page. Levels +that do not match are converted to \code{NA}. Integer ``map'' can be +applied to \code{integer} or \code{factor}, while character ``map'' can be +applied to \code{character} or \code{factor}. Result of \code{mapLevels<-} +is always a \code{factor} with possibly ``remapped'' levels. + +To get one joint character ``map'' for several factors, we need to put +factors in a \code{list} or \code{data.frame} and use arguments +\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to +unify levels and internal integer codes. + +<<ex08>>= +(bigMap <- mapLevels(x=list(f1, f2), + codes=FALSE, + combine=TRUE)) +mapLevels(f1) <- bigMap +mapLevels(f2) <- bigMap +f1 +f2 +cbind(as.character(f1), as.integer(f1), + as.character(f2), as.integer(f2)) +@ + +If we do not specify \code{combine=TRUE} (which is the default behaviour) +and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels} +returns ``map'' of class \code{listLevelsMap}. This is internally a +\code{list} of ``maps'' (\code{levelsMap} objects). Both +\code{listLevelsMap} and \code{levelsMap} objects can be passed to +\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when +length of \code{listLevelsMap} is not the same as number of +components/columns of a \code{list}/\code{data.frame}. + +Additional convenience methods are also implemented to ease the work with +``maps'': + +\begin{itemize} + +\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and + \code{as.listLevelsMap} for testing and coercion of user defined + ``maps'', + +\item \code{"["} for subsetting, + +\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap} + objects; argument \code{recursive=TRUE} can be used to coerce + \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2, + recursive=TRUE)} and + +\item \code{unique} and \code{sort} for \code{levelsMap}. + +\end{itemize} + +\section{Summary} + +Functions \code{mapLevels} and \code{mapLevels<-} can help users to map +internal integer codes to factor levels and unify levels as well as +internal integer codes among several factors. I welcome any comments or +suggestions. + +% \bibliography{refs} +\begin{thebibliography}{1} +\providecommand{\natexlab}[1]{#1} +\providecommand{\url}[1]{\texttt{#1}} +\expandafter\ifx\csname urlstyle\endcsname\relax + \providecommand{\doi}[1]{doi: #1}\else + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi + +\bibitem[Fox(2006)]{FoxCar} +J.~Fox. +\newblock \emph{car: Companion to Applied Regression}, 2006. +\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}. +\newblock R package version 1.1-1. + +\bibitem[Warnes(2006)]{WarnesGdata} +G.~R. Warnes. +\newblock \emph{gdata: Various R programming tools for data manipulation}, + 2006. +\newblock URL + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. +\newblock R package version 2.3.1. Includes R source code and/or documentation + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. + +\end{thebibliography} + +\address{Gregor Gorjanc\\ + University of Ljubljana, Slovenia\\ +\email{gre...@bf...}} + +\end{article} + +\end{document} Property changes on: trunk/gdata/inst/doc/mapLevels.Rnw ___________________________________________________________________ Name: svn:keywords + Id Modified: trunk/gdata/inst/doc/mapLevels.pdf =================================================================== (Binary files differ) Modified: trunk/gdata/inst/doc/mapLevels.tex =================================================================== --- trunk/gdata/inst/doc/mapLevels.tex 2007-08-20 10:00:38 UTC (rev 1149) +++ trunk/gdata/inst/doc/mapLevels.tex 2007-08-20 10:17:37 UTC (rev 1150) @@ -1,3 +1,8 @@ + +%\VignetteIndexEntry{Mapping levels of a factor} +%\VignettePackage{gdata} +%\VignetteKeywords{levels, factor, manip} + \documentclass[a4paper]{report} \usepackage{Rnews} \usepackage[round]{natbib} @@ -3,10 +8,13 @@ \bibliographystyle{abbrvnat} +\usepackage{Sweave} + + \begin{document} \begin{article} \title{Mapping levels of a factor} -\subtitle{The gdata package} +\subtitle{The \pkg{gdata} package} \author{by Gregor Gorjanc} @@ -41,15 +49,23 @@ classes. The following example shows the creation and printout of such a ``map''. -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > library(gdata) > (fac <- factor(c("B", "A", "Z", "D"))) +\end{Sinput} +\begin{Soutput} [1] B A Z D Levels: A B D Z +\end{Soutput} +\begin{Sinput} > (map <- mapLevels(x=fac)) -A B D Z -1 2 3 4 -\end{smallverbatim} +\end{Sinput} +\begin{Soutput} +A B D Z +1 2 3 4 +\end{Soutput} +\end{Schunk} If we have to work with internal integer codes, we can transform factor to integer and still get ``back the original factor'' with ``map'' used as @@ -57,16 +73,28 @@ is also an (S3) generic function and works on same classes as \code{mapLevels} plus \code{integer} atomic class. -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > (int <- as.integer(fac)) +\end{Sinput} +\begin{Soutput} [1] 2 1 4 3 +\end{Soutput} +\begin{Sinput} > mapLevels(x=int) <- map > int +\end{Sinput} +\begin{Soutput} [1] B A Z D Levels: A B D Z +\end{Soutput} +\begin{Sinput} > identical(fac, int) +\end{Sinput} +\begin{Soutput} [1] TRUE -\end{smallverbatim} +\end{Soutput} +\end{Schunk} Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow), but its print method unlists it for ease of inspection. ``Map'' from @@ -75,15 +103,19 @@ \code{levels<-} and the later can accept \code{list} with components of various lengths. -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > str(map) +\end{Sinput} +\begin{Soutput} List of 4 $ A: int 1 $ B: int 2 $ D: int 3 $ Z: int 4 - attr(*, "class")= chr "levelsMap" -\end{smallverbatim} +\end{Soutput} +\end{Schunk} Although not of primary importance, this ``map'' can also be used to remap factor levels as shown bellow. Components ``later'' in the map take over @@ -91,17 +123,25 @@ other approaches for ``remapping'' the levels of a \code{factor}, say \code{recode} in \pkg{car} package \citep{FoxCar}. -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > map[[2]] <- as.integer(c(1, 2)) > map -A B B D Z -1 1 2 3 4 +\end{Sinput} +\begin{Soutput} +A B B D Z +1 1 2 3 4 +\end{Soutput} +\begin{Sinput} > int <- as.integer(fac) > mapLevels(x=int) <- map > int +\end{Sinput} +\begin{Soutput} [1] B B Z D Levels: A B D Z -\end{smallverbatim} +\end{Soutput} +\end{Schunk} Up to now examples showed ``map'' with internal integer codes for values and levels for names. I call this integer ``map''. On the other hand @@ -112,14 +152,22 @@ and \code{f2} and that each factor does not have all levels. This is not uncommon situation. -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > (f1 <- factor(c("A", "D", "C"))) +\end{Sinput} +\begin{Soutput} [1] A D C Levels: A C D +\end{Soutput} +\begin{Sinput} > (f2 <- factor(c("B", "D", "C"))) +\end{Sinput} +\begin{Soutput} [1] B D C Levels: B C D -\end{smallverbatim} +\end{Soutput} +\end{Schunk} If we work with this factors, we need to be careful as they do not have the same set of levels. This can be solved with appropriately specifying @@ -127,25 +175,34 @@ "C", "D")} or with proper use of \code{levels<-} function. I say proper as it is very tempting to use: -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > fTest <- f1 > levels(fTest) <- c("A", "B", "C", "D") > fTest +\end{Sinput} +\begin{Soutput} [1] A C B Levels: A B C D -\end{smallverbatim} +\end{Soutput} +\end{Schunk} Above example extends set of levels, but also changes level of 2nd and 3rd element in \code{fTest}! Proper use of \code{levels<-} (as shown in \code{levels} help page) would be: -\begin{smallverbatim} +\begin{Schunk} +\begin{Sinput} > fTest <- f1 -> levels(fTest) <- list(A="A", B="B", C="C", D="D") +> levels(fTest) <- list(A="A", B="B", ++ C="C", D="D") > fTest +\end{Sinput} +\begin{Soutput} [1] A D C Levels: A B C D -\end{smallverbatim} +\end{Soutput} +\end{Schunk} Function \code{mapLevels} with character ``map'' can help us in such scenarios to unify levels and internal integer codes across several @@ -159,31 +216,48 @@ applied to \code{character} or \code{factor}. Result of \code{mapLevels<-} is always a \code{factor} with possibly ``remapped'' levels. -To get one joint character ``map'' for several factors, we need to -put factors in a \code{list} or \code{data.frame} and use arguments +To get one joint character ``map'' for several factors, we need to put +factors in a \code{list} or \code{data.frame} and use arguments \code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to unify levels and internal integer codes. -\begin{smallverbatim} -> (bigMap <- mapLevels(x=list(f1, f2), codes=FALSE, +\begin{Schunk} +\begin{Sinput} +> (bigMap <- mapLevels(x=list(f1, f2), ++ codes=FALSE, + combine=TRUE)) - A B C D -"A" "B" "C" "D" +\end{Sinput} +\begin{Soutput} + A B C D +"A" "B" "C" "D" +\end{Soutput} +\begin{Sinput} > mapLevels(f1) <- bigMap > mapLevels(f2) <- bigMap > f1 +\end{Sinput} +\begin{Soutput} [1] A D C Levels: A B C D +\end{Soutput} +\begin{Sinput} > f2 +\end{Sinput} +\begin{Soutput} [1] B D C Levels: A B C D +\end{Soutput} +\begin{Sinput} > cbind(as.character(f1), as.integer(f1), + as.character(f2), as.integer(f2)) +\end{Sinput} +\begin{Soutput} [,1] [,2] [,3] [,4] -[1,] "A" "1" "B" "2" -[2,] "D" "4" "D" "4" -[3,] "C" "3" "C" "3" -\end{smallverbatim} +[1,] "A" "1" "B" "2" +[2,] "D" "4" "D" "4" +[3,] "C" "3" "C" "3" +\end{Soutput} +\end{Schunk} If we do not specify \code{combine=TRUE} (which is the default behaviour) and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels} @@ -235,7 +309,7 @@ \newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}. \newblock R package version 1.1-1. -\bibitem[Warnes.(2006)]{WarnesGdata} +\bibitem[Warnes(2006)]{WarnesGdata} G.~R. Warnes. \newblock \emph{gdata: Various R programming tools for data manipulation}, 2006. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |