[R-gregmisc-users] SF.net SVN: r-gregmisc: [1097] trunk/gdata/inst/doc
Brought to you by:
warnes
From: <gg...@us...> - 2007-06-06 10:24:17
|
Revision: 1097 http://svn.sourceforge.net/r-gregmisc/?rev=1097&view=rev Author: ggorjan Date: 2007-06-06 03:24:15 -0700 (Wed, 06 Jun 2007) Log Message: ----------- last edits from newsletter Modified Paths: -------------- trunk/gdata/inst/doc/unknown.pdf trunk/gdata/inst/doc/unknown.tex Modified: trunk/gdata/inst/doc/unknown.pdf =================================================================== (Binary files differ) Modified: trunk/gdata/inst/doc/unknown.tex =================================================================== --- trunk/gdata/inst/doc/unknown.tex 2007-06-06 10:19:15 UTC (rev 1096) +++ trunk/gdata/inst/doc/unknown.tex 2007-06-06 10:24:15 UTC (rev 1097) @@ -7,108 +7,109 @@ \begin{article} -\title{Working with unknown values} -\subtitle{The gdata package} +\title{Working with Unknown Values} +\subtitle{The \pkg{gdata} package} \author{by Gregor Gorjanc} \maketitle -Published as \cite{Gorjanc}. +This vignette has been published as \cite{Gorjanc}. \section{Introduction} Unknown or missing values can be represented in various ways. For example -SAS uses \code{.} (dot), while \R{} uses \code{NA}, which we can read as +SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as Not Available. When we import data into \R{}, say via \code{read.table} or its derivatives, conversion of blank fields to \code{NA} (according to \code{read.table} help) is done for \code{logical}, \code{integer}, -\code{numeric} and \code{complex} classes. Additionally, \code{na.strings} +\code{numeric} and \code{complex} classes. Additionally, +the \code{na.strings} argument can be used to specify values that should also be converted to -\code{NA}. Inversely there is an argument \code{na} in \code{write.table} +\code{NA}. Inversely, there is an argument \code{na} in \code{write.table} and its derivatives to define value that will replace \code{NA} in exported data. There are also other ways to import/export data into \R{} as -described in ``R Data Import/Export'' Manual \citep{RImportExportManual}, -however all approaches lack the possibility to define unknown value(s) for -particular column. It is possible that unknown value in one column is a -valid value in another column. For example I have seen many datasets where -values as 0, -9, 999 and specific dates are used as column specific unknown -values in one dataset. +described in the {\emph R Data Import/Export} manual \citep{RImportExportManual}. +However, all approaches lack the possibility to define unknown value(s) for +some particular column. It is possible that an unknown value in one column is a +valid value in another column. For example, I have seen many datasets where +values such as 0, -9, 999 and specific dates are used as column specific unknown +values. -This note represents set of functions in \pkg{gdata}\footnote{from version - 2.3.1} package \citep{WarnesGdata}: \code{isUnknown}, \code{unknownToNA} -and \code{NAToUnknown}, which can help with testing for unknown values and -conversions between unknown values and \code{NA}. All three functions are -generic (S3) and were tested (at the time of writing) to work with: -\code{integer}, \code{numeric}, \code{character}, \code{factor}, -\code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, \code{data.frame} -and \code{matrix} classes. +This note describes a set of functions in package \pkg{gdata}\footnote{ +package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown}, +\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for +unknown values and conversions between unknown values and \code{NA}. All +three functions are generic (S3) and were tested (at the time of writing) +to work with: \code{integer}, \code{numeric}, \code{character}, +\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, +\code{data.frame} and \code{matrix} classes. \section{Description with examples} -The following examples show simple usage of this functions on +The following examples show simple usage of these functions on \code{numeric} and \code{factor} classes, where value \code{0} (beside -\code{NA}) should be treated as unknown value: +\code{NA}) should be treated as an unknown value: \begin{smallverbatim} -> library(gdata) +> library("gdata") > xNum <- c(0, 6, 0, 7, 8, 9, NA) > isUnknown(x=xNum) -[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE +[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE \end{smallverbatim} The default unknown value in \code{isUnknown} is \code{NA}, which means -that output is the same to \code{is.na} - at least for atomic -classes. However, we can pass argument \code{unknown} to define which -values should be treated as unknown. +that output is the same as \code{is.na} --- at least for atomic +classes. However, we can pass the argument \code{unknown} to define which +values should be treated as unknown: \begin{smallverbatim} > isUnknown(x=xNum, unknown=0) -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE \end{smallverbatim} -This skipped \code{NA}, but we can get expected answer after appropriately -adding \code{NA} into argument \code{unknown}. +This skipped \code{NA}, but we can get the expected answer after appropriately +adding \code{NA} into the argument \code{unknown}: \begin{smallverbatim} > isUnknown(x=xNum, unknown=c(0, NA)) -[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE +[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE \end{smallverbatim} Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. There is clearly no need to add \code{NA} here. This step is very handy -after importing data from external source, where many different unknown +after importing data from an external source, where many different unknown values might be used. Argument \code{warning=TRUE} can be used, if there is -a need to be warned about ``original'' \code{NA}s. +a need to be warned about ``original'' \code{NA}s: \begin{smallverbatim} > xNum2 <- unknownToNA(x=xNum, unknown=0) [1] NA 6 NA 7 8 9 NA \end{smallverbatim} -Prior to export from \R{}, we might want to change unknown value (\code{NA} +Prior to export from \R{}, we might want to change unknown values (\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be used for -this. +this: \begin{smallverbatim} > NAToUnknown(x=xNum2, unknown=999) [1] 999 6 999 7 8 9 999 \end{smallverbatim} -Converting \code{NA} to value that already exists in \code{x} issues an -error, however \code{force=TRUE} can be used to overcome this if -needed. But be warned that there is no way back from this step. +Converting \code{NA} to a value that already exists in \code{x} issues an +error, but \code{force=TRUE} can be used to overcome this if +needed. But be warned that there is no way back from this step: \begin{smallverbatim} > NAToUnknown(x=xNum2, unknown=7, force=TRUE) [1] 7 6 7 7 8 9 7 \end{smallverbatim} -Examples bellow show all peculiarities with \code{factor} class. +Examples below show all peculiarities with class \code{factor}. \code{unknownToNA} removes \code{unknown} value from levels and inversely \code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is -properly distinguished from \code{NA}. It can also be seen that argument +properly distinguished from \code{NA}. It can also be seen that the argument \code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need -not match class of \code{x} (otherwise factor should be used) as the test +not match the class of \code{x} (otherwise factor should be used) as the test is internally done with \code{\%in\%}, which nicely resolves coercing issues. @@ -137,33 +138,34 @@ new level is introduced: 0 \end{smallverbatim} -These two examples with numeric an factor classes are fairly simple and we +These two examples with classes \code{numeric} and \code{factor} are fairly simple and we could get the same results with one or two lines of \R{} code. The real -benefit of presented set of functions is in \code{list} and +benefit of the set of functions presented here is in \code{list} and \code{data.frame} methods, where \code{data.frame} methods are merely wrappers for \code{list} methods. We need additional flexibility for \code{list}/\code{data.frame} methods, -due to possibility of having multiple unknown values that can be different +due to possibly having multiple unknown values that can be different among \code{list} components or \code{data.frame} columns. For these two -methods, argument \code{unknown} can be either a \code{vector} or +methods, the argument \code{unknown} can be either a \code{vector} or \code{list}, both possibly named. Of course, greater flexibility (defining multiple unknown values per component/column) can be achieved with a \code{list}. -When \code{vector}/\code{list} passed to argument \code{unknown} is not -named, first value/component of a \code{vector}/\code{list} matches first +When a \code{vector}/\code{list} object passed to the argument \code{unknown} is not +named, the first value/component of a \code{vector}/\code{list} matches the first component/column of a \code{list}/\code{data.frame}. This can be quite -error prone, especially with \code{vectors}. Therefore, I encourage use of +error prone, especially with \code{vectors}. Therefore, I encourage the use of a \code{list}. In case \code{vector}/\code{list} passed to argument \code{unknown} is named, names are matched to names of \code{list} or \code{data.frame}. If lengths of \code{unknown} and \code{list} or \code{data.frame} do not match, recycling occurs. -Example bellow shows usage of described functions on a list, that is +The example below illustrates the application of the +described functions to a list which is composed of previously defined and modified numeric (\code{xNum}) and -factor (\code{xFac}) classes. First function \code{isUnknown} is used with -\code{0} as unknown value. Note that we get \code{FALSE} for \code{NA}s as +factor (\code{xFac}) classes. First, function \code{isUnknown} is used with +\code{0} as an unknown value. Note that we get \code{FALSE} for \code{NA}s as has been the case in the first example. \begin{smallverbatim} @@ -176,33 +178,33 @@ Levels: 0 BA NA RA > isUnknown(x=xList, unknown=0) $a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE $b -[1] TRUE FALSE FALSE FALSE TRUE FALSE +[1] TRUE FALSE FALSE FALSE TRUE FALSE \end{smallverbatim} -We need to add \code{NA} as unknown value. However, we do not get the +We need to add \code{NA} as an unknown value. However, we do not get the expected result this way! \begin{smallverbatim} > isUnknown(x=xList, unknown=c(0, NA)) $a -[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE $b [1] FALSE FALSE FALSE FALSE FALSE FALSE \end{smallverbatim} -This is due to matching of values in argument \code{unknown} and components -in a \code{list} i.e. \code{0} is used for component \code{a} and \code{NA} +This is due to matching of values in the argument \code{unknown} and components +in a \code{list}; i.e., \code{0} is used for component \code{a} and \code{NA} for component \code{b}. Therefore, it is less error prone and more -flexible to pass \code{list} (preferably named one) to argument -\code{unknown} as shown bellow. +flexible to pass a \code{list} (preferably a named list) to the argument +\code{unknown}, as shown below. \begin{smallverbatim} > xList1 <- unknownToNA(x=xList, -+ unknown=list(b=c(0, "NA"), a=0)) ++ unknown=list(b=c(0, "NA"), a=0)) $a [1] NA 6 NA 7 8 9 NA @@ -212,7 +214,7 @@ \end{smallverbatim} Changing \code{NA}s to some other value (only one per component/column) can -be now something like this: +be accomplished as follows: \begin{smallverbatim} > NAToUnknown(x=xList1, unknown=list(b="no", a=0)) @@ -227,11 +229,11 @@ new level is introduced: no \end{smallverbatim} -Named component \code{.default} of a \code{list} passed to argument -\code{unknown} has a special meaning as it will match component/column with +A named component \code{.default} of a \code{list} passed to argument +\code{unknown} has a special meaning as it will match a component/column with that name and any other not defined in \code{unknown}. As such it is very useful if the number of components/columns with the same unknown value(s) -is large. Imagine a wide \code{data.frame} named \code{df}. Now +is large. Consider a wide \code{data.frame} named \code{df}. Now \code{.default} can be used to define unknown value for several columns: \begin{smallverbatim} @@ -242,8 +244,8 @@ \end{smallverbatim} If there is a need to work only on some components/columns you can of -course ``skip'' columns with standard \R{} mechanisms i.e. -with subsetting \code{list} or \code{data.frame} objects. +course ``skip'' columns with standard \R{} mechanisms, i.e., +by subsetting \code{list} or \code{data.frame} objects: \begin{smallverbatim} > cols <- c("col1", "col2") @@ -255,7 +257,7 @@ \section{Summary} Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} -provide a nice interface to work with various representations of +provide a useful interface to work with various representations of unknown/missing values. Their use is meant primarily for shaping the data after importing to or before exporting from \R{}. I welcome any comments or suggestions. @@ -272,8 +274,8 @@ \bibitem[Gorjanc(2007)]{Gorjanc} G.~Gorjanc. \newblock Working with unknown values: the gdata package. -\newblock \emph{R News}, 1\penalty0 (1):\penalty0 ?--?, ? 2007. -\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-?.pdf}. +\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007. +\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}. \bibitem[{R Development Core Team}(2006)]{RImportExportManual} {R Development Core Team}. @@ -281,7 +283,7 @@ \newblock URL \url{http://cran.r-project.org/manuals.html}. \newblock ISBN 3-900051-10-0. -\bibitem[Warnes.(2006)]{WarnesGdata} +\bibitem[Warnes (2006)]{WarnesGdata} G.~R. Warnes. \newblock \emph{gdata: Various R programming tools for data manipulation}, 2006. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |