[R-gregmisc-users] SF.net SVN: r-gregmisc: [1033] trunk/gdata/inst/doc
Brought to you by:
warnes
From: <gg...@us...> - 2006-11-30 21:32:18
|
Revision: 1033 http://svn.sourceforge.net/r-gregmisc/?rev=1033&view=rev Author: ggorjan Date: 2006-11-30 13:32:08 -0800 (Thu, 30 Nov 2006) Log Message: ----------- description of unknown methods Added Paths: ----------- trunk/gdata/inst/doc/unknown.pdf trunk/gdata/inst/doc/unknown.tex Added: trunk/gdata/inst/doc/unknown.pdf =================================================================== (Binary files differ) Property changes on: trunk/gdata/inst/doc/unknown.pdf ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/gdata/inst/doc/unknown.tex =================================================================== --- trunk/gdata/inst/doc/unknown.tex (rev 0) +++ trunk/gdata/inst/doc/unknown.tex 2006-11-30 21:32:08 UTC (rev 1033) @@ -0,0 +1,301 @@ +\documentclass[a4paper]{report} +\usepackage{Rnews} +\usepackage[round]{natbib} +\bibliographystyle{abbrvnat} + +\begin{document} + +\begin{article} + +\title{Working with unknown values} +\subtitle{The gdata package} +\author{by Gregor Gorjanc} + +\maketitle + +Published as \cite{Gorjanc}. + +\section{Introduction} + +Unknown or missing values can be represented in various ways. For example +SAS uses \code{.} (dot), while \R{} uses \code{NA}, which we can read as +Not Available. When we import data into \R{}, say via \code{read.table} or +its derivatives, conversion of blank fields to \code{NA} (according to +\code{read.table} help) is done for \code{logical}, \code{integer}, +\code{numeric} and \code{complex} classes. Additionally, \code{na.strings} +argument can be used to specify values that should also be converted to +\code{NA}. Inversely there is an argument \code{na} in \code{write.table} +and its derivatives to define value that will replace \code{NA} in exported +data. There are also other ways to import/export data into \R{} as +described in ``R Data Import/Export'' Manual \citep{RImportExportManual}, +however all approaches lack the possibility to define unknown value(s) for +particular column. It is possible that unknown value in one column is a +valid value in another column. For example I have seen many datasets where +values as 0, -9, 999 and specific dates are used as column specific unknown +values in one dataset. + +This note represents set of functions in \pkg{gdata}\footnote{from version + 2.3.1} package \citep{WarnesGdata}: \code{isUnknown}, \code{unknownToNA} +and \code{NAToUnknown}, which can help with testing for unknown values and +conversions between unknown values and \code{NA}. All three functions are +generic (S3) and were tested (at the time of writing) to work with: +\code{integer}, \code{numeric}, \code{character}, \code{factor}, +\code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, \code{data.frame} +and \code{matrix} classes. + +\section{Description with examples} + +The following examples show simple usage of this functions on +\code{numeric} and \code{factor} classes, where value \code{0} (beside +\code{NA}) should be treated as unknown value: + +\begin{smallverbatim} +> library(gdata) +> xNum <- c(0, 6, 0, 7, 8, 9, NA) +> isUnknown(x=xNum) +[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE +\end{smallverbatim} + +The default unknown value in \code{isUnknown} is \code{NA}, which means +that output is the same to \code{is.na} - at least for atomic +classes. However, we can pass argument \code{unknown} to define which +values should be treated as unknown. + +\begin{smallverbatim} +> isUnknown(x=xNum, unknown=0) +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE +\end{smallverbatim} + +This skipped \code{NA}, but we can get expected answer after appropriately +adding \code{NA} into argument \code{unknown}. + +\begin{smallverbatim} +> isUnknown(x=xNum, unknown=c(0, NA)) +[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE +\end{smallverbatim} + +Now, we can change all unknown values to \code{NA} with \code{unknownToNA}. +There is clearly no need to add \code{NA} here. This step is very handy +after importing data from external source, where many different unknown +values might be used. Argument \code{warning=TRUE} can be used, if there is +a need to be warned about ``original'' \code{NA}s. + +\begin{smallverbatim} +> xNum2 <- unknownToNA(x=xNum, unknown=0) +[1] NA 6 NA 7 8 9 NA +\end{smallverbatim} + +Prior to export from \R{}, we might want to change unknown value (\code{NA} +in \R{}) to some other value. Function \code{NAToUnknown} can be used for +this. + +\begin{smallverbatim} +> NAToUnknown(x=xNum2, unknown=999) +[1] 999 6 999 7 8 9 999 +\end{smallverbatim} + +Converting \code{NA} to value that already exists in \code{x} issues an +error, however \code{force=TRUE} can be used to overcome this if +needed. But be warned that there is no way back from this step. + +\begin{smallverbatim} +> NAToUnknown(x=xNum2, unknown=7, force=TRUE) +[1] 7 6 7 7 8 9 7 +\end{smallverbatim} + +Examples bellow show all peculiarities with \code{factor} class. +\code{unknownToNA} removes \code{unknown} value from levels and inversely +\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is +properly distinguished from \code{NA}. It can also be seen that argument +\code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need +not match class of \code{x} (otherwise factor should be used) as the test +is internally done with \code{\%in\%}, which nicely resolves coercing +issues. + +\begin{smallverbatim} +> xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")) +[1] 0 BA RA BA <NA> NA +Levels: 0 BA NA RA +> isUnknown(x=xFac) +[1] FALSE FALSE FALSE FALSE TRUE FALSE +> isUnknown(x=xFac, unknown=0) +[1] TRUE FALSE FALSE FALSE FALSE FALSE +> isUnknown(x=xFac, unknown=c(0, NA)) +[1] TRUE FALSE FALSE FALSE TRUE FALSE +> isUnknown(x=xFac, unknown=c(0, "NA")) +[1] TRUE FALSE FALSE FALSE FALSE TRUE +> isUnknown(x=xFac, unknown=c(0, "NA", NA)) +[1] TRUE FALSE FALSE FALSE TRUE TRUE + +> xFac <- unknownToNA(x=xFac, unknown=0) +[1] <NA> BA RA BA <NA> NA +Levels: BA NA RA +> xFac <- NAToUnknown(x=xFac, unknown=0) +[1] 0 BA RA BA 0 NA +Levels: 0 BA NA RA +Warning message: +new level is introduced: 0 +\end{smallverbatim} + +These two examples with numeric an factor classes are fairly simple and we +could get the same results with one or two lines of \R{} code. The real +benefit of presented set of functions is in \code{list} and +\code{data.frame} methods, where \code{data.frame} methods are merely +wrappers for \code{list} methods. + +We need additional flexibility for \code{list}/\code{data.frame} methods, +due to possibility of having multiple unknown values that can be different +among \code{list} components or \code{data.frame} columns. For these two +methods, argument \code{unknown} can be either a \code{vector} or +\code{list}, both possibly named. Of course, greater flexibility (defining +multiple unknown values per component/column) can be achieved with +a \code{list}. + +When \code{vector}/\code{list} passed to argument \code{unknown} is not +named, first value/component of a \code{vector}/\code{list} matches first +component/column of a \code{list}/\code{data.frame}. This can be quite +error prone, especially with \code{vectors}. Therefore, I encourage use of +a \code{list}. In case \code{vector}/\code{list} passed to argument +\code{unknown} is named, names are matched to names of \code{list} or +\code{data.frame}. If lengths of \code{unknown} and \code{list} or +\code{data.frame} do not match, recycling occurs. + +Example bellow shows usage of described functions on a list, that is +composed of previously defined and modified numeric (\code{xNum}) and +factor (\code{xFac}) classes. First function \code{isUnknown} is used with +\code{0} as unknown value. Note that we get \code{FALSE} for \code{NA}s as +has been the case in the first example. + +\begin{smallverbatim} +> xList <- list(a=xNum, b=xFac) +$a +[1] 0 6 0 7 8 9 NA + +$b +[1] 0 BA RA BA 0 NA +Levels: 0 BA NA RA +> isUnknown(x=xList, unknown=0) +$a +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE + +$b +[1] TRUE FALSE FALSE FALSE TRUE FALSE +\end{smallverbatim} + +We need to add \code{NA} as unknown value. However, we do not get the +expected result this way! + +\begin{smallverbatim} +> isUnknown(x=xList, unknown=c(0, NA)) +$a +[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE + +$b +[1] FALSE FALSE FALSE FALSE FALSE FALSE +\end{smallverbatim} + +This is due to matching of values in argument \code{unknown} and components +in a \code{list} i.e. \code{0} is used for component \code{a} and \code{NA} +for component \code{b}. Therefore, it is less error prone and more +flexible to pass \code{list} (preferably named one) to argument +\code{unknown} as shown bellow. + +\begin{smallverbatim} +> xList1 <- unknownToNA(x=xList, ++ unknown=list(b=c(0, "NA"), a=0)) +$a +[1] NA 6 NA 7 8 9 NA + +$b +[1] <NA> BA RA BA <NA> <NA> +Levels: BA RA +\end{smallverbatim} + +Changing \code{NA}s to some other value (only one per component/column) can +be now something like this: + +\begin{smallverbatim} +> NAToUnknown(x=xList1, unknown=list(b="no", a=0)) +$a +[1] 0 6 0 7 8 9 0 + +$b +[1] no BA RA BA no no +Levels: BA no RA + +Warning message: +new level is introduced: no +\end{smallverbatim} + +Named component \code{.default} of a \code{list} passed to argument +\code{unknown} has a special meaning as it will match component/column with +that name and any other not defined in \code{unknown}. As such it is very +useful if the number of components/columns with the same unknown value(s) +is large. Imagine a wide \code{data.frame} named \code{df}. Now +\code{.default} can be used to define unknown value for several columns: + +\begin{smallverbatim} +> df <- unknownToNA(x=df, ++ unknown=(.default=0, ++ col1=999, ++ col2="unknown")) +\end{smallverbatim} + +If there is a need to work only on some components/columns you can of +course ``skip'' columns with standard \R{} mechanisms i.e. +with subsetting \code{list} or \code{data.frame} objects. + +\begin{smallverbatim} +> cols <- c("col1", "col2") +> df[, cols] <- unknownToNA(x=df[, cols], ++ unknown=(col1=999, ++ col2="unknown")) +\end{smallverbatim} + +\section{Summary} + +Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown} +provide a nice interface to work with various representations of +unknown/missing values. Their use is meant primarily for shaping the data +after importing to or before exporting from \R{}. I welcome any comments or +suggestions. + +% \bibliography{refs} + +\begin{thebibliography}{1} +\providecommand{\natexlab}[1]{#1} +\providecommand{\url}[1]{\texttt{#1}} +\expandafter\ifx\csname urlstyle\endcsname\relax + \providecommand{\doi}[1]{doi: #1}\else + \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi + +\bibitem[Gorjanc(2007)]{Gorjanc} +G.~Gorjanc. +\newblock Working with unknown values: the gdata package. +\newblock \emph{R News}, 1\penalty0 (1):\penalty0 ?--?, ? 2007. +\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-?.pdf}. + +\bibitem[{R Development Core Team}(2006)]{RImportExportManual} +{R Development Core Team}. +\newblock \emph{R Data Import/Export}, 2006. +\newblock URL \url{http://cran.r-project.org/manuals.html}. +\newblock ISBN 3-900051-10-0. + +\bibitem[Warnes.(2006)]{WarnesGdata} +G.~R. Warnes. +\newblock \emph{gdata: Various R programming tools for data manipulation}, + 2006. +\newblock URL + \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}. +\newblock R package version 2.3.1. Includes R source code and/or documentation + contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley. + +\end{thebibliography} + +\address{Gregor Gorjanc\\ + University of Ljubljana, Slovenia\\ +\email{gre...@bf...}} + +\end{article} + +\end{document} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |