[R-gregmisc-users] SF.net SVN: r-gregmisc: [1033] trunk/gdata/inst/doc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1033
          http://svn.sourceforge.net/r-gregmisc/?rev=1033&view=rev
Author:   ggorjan
Date:     2006-11-30 13:32:08 -0800 (Thu, 30 Nov 2006)

Log Message:
-----------
description of unknown methods

Added Paths:
-----------
    trunk/gdata/inst/doc/unknown.pdf
    trunk/gdata/inst/doc/unknown.tex

Added: trunk/gdata/inst/doc/unknown.pdf
===================================================================
(Binary files differ)


Property changes on: trunk/gdata/inst/doc/unknown.pdf
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Added: trunk/gdata/inst/doc/unknown.tex
===================================================================

--- trunk/gdata/inst/doc/unknown.tex	                        (rev 0)
+++ trunk/gdata/inst/doc/unknown.tex	2006-11-30 21:32:08 UTC (rev 1033)
@@ -0,0 +1,301 @@
+\documentclass[a4paper]{report}
+\usepackage{Rnews}
+\usepackage[round]{natbib}
+\bibliographystyle{abbrvnat}
+
+\begin{document}
+
+\begin{article}
+
+\title{Working with unknown values}
+\subtitle{The gdata package}
+\author{by Gregor Gorjanc}
+
+\maketitle
+
+Published as \cite{Gorjanc}.
+
+\section{Introduction}
+
+Unknown or missing values can be represented in various ways. For example
+SAS uses \code{.} (dot), while \R{} uses \code{NA}, which we can read as
+Not Available. When we import data into \R{}, say via \code{read.table} or
+its derivatives, conversion of blank fields to \code{NA} (according to
+\code{read.table} help) is done for \code{logical}, \code{integer},
+\code{numeric} and \code{complex} classes. Additionally, \code{na.strings}
+argument can be used to specify values that should also be converted to
+\code{NA}. Inversely there is an argument \code{na} in \code{write.table}
+and its derivatives to define value that will replace \code{NA} in exported
+data. There are also other ways to import/export data into \R{} as
+described in ``R Data Import/Export'' Manual \citep{RImportExportManual},
+however all approaches lack the possibility to define unknown value(s) for
+particular column. It is possible that unknown value in one column is a
+valid value in another column. For example I have seen many datasets where
+values as 0, -9, 999 and specific dates are used as column specific unknown
+values in one dataset.
+
+This note represents set of functions in \pkg{gdata}\footnote{from version
+  2.3.1} package \citep{WarnesGdata}: \code{isUnknown}, \code{unknownToNA}
+and \code{NAToUnknown}, which can help with testing for unknown values and
+conversions between unknown values and \code{NA}. All three functions are
+generic (S3) and were tested (at the time of writing) to work with:
+\code{integer}, \code{numeric}, \code{character}, \code{factor},
+\code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list}, \code{data.frame}
+and \code{matrix} classes.
+
+\section{Description with examples}
+
+The following examples show simple usage of this functions on
+\code{numeric} and \code{factor} classes, where value \code{0} (beside
+\code{NA}) should be treated as unknown value:
+
+\begin{smallverbatim}
+> library(gdata)
+> xNum <- c(0, 6, 0, 7, 8, 9, NA)
+> isUnknown(x=xNum)
+[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
+\end{smallverbatim}
+
+The default unknown value in \code{isUnknown} is \code{NA}, which means
+that output is the same to \code{is.na} - at least for atomic
+classes. However, we can pass argument \code{unknown} to define which
+values should be treated as unknown.
+
+\begin{smallverbatim}
+> isUnknown(x=xNum, unknown=0)
+[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
+\end{smallverbatim}
+
+This skipped \code{NA}, but we can get expected answer after appropriately
+adding \code{NA} into argument \code{unknown}.
+
+\begin{smallverbatim}
+> isUnknown(x=xNum, unknown=c(0, NA))
+[1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
+\end{smallverbatim}
+
+Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
+There is clearly no need to add \code{NA} here. This step is very handy
+after importing data from external source, where many different unknown
+values might be used. Argument \code{warning=TRUE} can be used, if there is
+a need to be warned about ``original'' \code{NA}s.
+
+\begin{smallverbatim}
+> xNum2 <- unknownToNA(x=xNum, unknown=0)
+[1] NA  6 NA  7  8  9 NA
+\end{smallverbatim}
+
+Prior to export from \R{}, we might want to change unknown value (\code{NA}
+in \R{}) to some other value. Function \code{NAToUnknown} can be used for
+this.
+
+\begin{smallverbatim}
+> NAToUnknown(x=xNum2, unknown=999)
+[1] 999   6 999   7   8   9 999
+\end{smallverbatim}
+
+Converting \code{NA} to value that already exists in \code{x} issues an
+error, however \code{force=TRUE} can be used to overcome this if
+needed. But be warned that there is no way back from this step.
+
+\begin{smallverbatim}
+> NAToUnknown(x=xNum2, unknown=7, force=TRUE)
+[1] 7 6 7 7 8 9 7
+\end{smallverbatim}
+
+Examples bellow show all peculiarities with \code{factor} class.
+\code{unknownToNA} removes \code{unknown} value from levels and inversely
+\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
+properly distinguished from \code{NA}. It can also be seen that argument
+\code{unknown} in functions \code{isUnknown} and \code{unknownToNA} need
+not match class of \code{x} (otherwise factor should be used) as the test
+is internally done with \code{\%in\%}, which nicely resolves coercing
+issues.
+
+\begin{smallverbatim}
+> xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA"))
+[1] 0    BA   RA   BA   <NA> NA
+Levels: 0 BA NA RA
+> isUnknown(x=xFac)
+[1] FALSE FALSE FALSE FALSE  TRUE FALSE
+> isUnknown(x=xFac, unknown=0)
+[1]  TRUE FALSE FALSE FALSE FALSE FALSE
+> isUnknown(x=xFac, unknown=c(0, NA))
+[1]  TRUE FALSE FALSE FALSE  TRUE FALSE
+> isUnknown(x=xFac, unknown=c(0, "NA"))
+[1]  TRUE FALSE FALSE FALSE FALSE  TRUE
+> isUnknown(x=xFac, unknown=c(0, "NA", NA))
+[1]  TRUE FALSE FALSE FALSE  TRUE  TRUE
+
+> xFac <- unknownToNA(x=xFac, unknown=0)
+[1] <NA> BA   RA   BA   <NA> NA
+Levels: BA NA RA
+> xFac <- NAToUnknown(x=xFac, unknown=0)
+[1] 0  BA RA BA 0  NA
+Levels: 0 BA NA RA
+Warning message:
+new level is introduced: 0
+\end{smallverbatim}
+
+These two examples with numeric an factor classes are fairly simple and we
+could get the same results with one or two lines of \R{} code. The real
+benefit of presented set of functions is in \code{list} and
+\code{data.frame} methods, where \code{data.frame} methods are merely
+wrappers for \code{list} methods.
+
+We need additional flexibility for \code{list}/\code{data.frame} methods,
+due to possibility of having multiple unknown values that can be different
+among \code{list} components or \code{data.frame} columns. For these two
+methods, argument \code{unknown} can be either a \code{vector} or
+\code{list}, both possibly named. Of course, greater flexibility (defining
+multiple unknown values per component/column) can be achieved with
+a \code{list}.
+
+When \code{vector}/\code{list} passed to argument \code{unknown} is not
+named, first value/component of a \code{vector}/\code{list} matches first
+component/column of a \code{list}/\code{data.frame}. This can be quite
+error prone, especially with \code{vectors}. Therefore, I encourage use of
+a \code{list}. In case \code{vector}/\code{list} passed to argument
+\code{unknown} is named, names are matched to names of \code{list} or
+\code{data.frame}. If lengths of \code{unknown} and \code{list} or
+\code{data.frame} do not match, recycling occurs.
+
+Example bellow shows usage of described functions on a list, that is
+composed of previously defined and modified numeric (\code{xNum}) and
+factor (\code{xFac}) classes. First function \code{isUnknown} is used with
+\code{0} as unknown value. Note that we get \code{FALSE} for \code{NA}s as
+has been the case in the first example.
+
+\begin{smallverbatim}
+> xList <- list(a=xNum, b=xFac)
+$a
+[1]  0  6  0  7  8  9 NA
+
+$b
+[1] 0  BA RA BA 0  NA
+Levels: 0 BA NA RA
+> isUnknown(x=xList, unknown=0)
+$a
+[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
+
+$b
+[1]  TRUE FALSE FALSE FALSE  TRUE FALSE
+\end{smallverbatim}
+
+We need to add \code{NA} as unknown value. However, we do not get the
+expected result this way!
+
+\begin{smallverbatim}
+> isUnknown(x=xList, unknown=c(0, NA))
+$a
+[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
+
+$b
+[1] FALSE FALSE FALSE FALSE FALSE FALSE
+\end{smallverbatim}
+
+This is due to matching of values in argument \code{unknown} and components
+in a \code{list} i.e. \code{0} is used for component \code{a} and \code{NA}
+for component \code{b}.  Therefore, it is less error prone and more
+flexible to pass \code{list} (preferably named one) to argument
+\code{unknown} as shown bellow.
+
+\begin{smallverbatim}
+> xList1 <- unknownToNA(x=xList,
++             unknown=list(b=c(0, "NA"), a=0))
+$a
+[1] NA  6 NA  7  8  9 NA
+
+$b
+[1] <NA> BA   RA   BA   <NA> <NA>
+Levels: BA RA
+\end{smallverbatim}
+
+Changing \code{NA}s to some other value (only one per component/column) can
+be now something like this:
+
+\begin{smallverbatim}
+> NAToUnknown(x=xList1, unknown=list(b="no", a=0))
+$a
+[1] 0 6 0 7 8 9 0
+
+$b
+[1] no BA RA BA no no
+Levels: BA no RA
+
+Warning message:
+new level is introduced: no
+\end{smallverbatim}
+
+Named component \code{.default} of a \code{list} passed to argument
+\code{unknown} has a special meaning as it will match component/column with
+that name and any other not defined in \code{unknown}. As such it is very
+useful if the number of components/columns with the same unknown value(s)
+is large. Imagine a wide \code{data.frame} named \code{df}. Now
+\code{.default} can be used to define unknown value for several columns:
+
+\begin{smallverbatim}
+> df <- unknownToNA(x=df,
++                   unknown=(.default=0,
++                            col1=999,
++                            col2="unknown"))
+\end{smallverbatim}
+
+If there is a need to work only on some components/columns you can of
+course ``skip'' columns with standard \R{} mechanisms i.e.
+with subsetting \code{list} or \code{data.frame} objects.
+
+\begin{smallverbatim}
+> cols <- c("col1", "col2")
+> df[, cols] <- unknownToNA(x=df[, cols],
++                           unknown=(col1=999,
++                                    col2="unknown"))
+\end{smallverbatim}
+
+\section{Summary}
+
+Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
+provide a nice interface to work with various representations of
+unknown/missing values. Their use is meant primarily for shaping the data
+after importing to or before exporting from \R{}. I welcome any comments or
+suggestions.
+
+% \bibliography{refs}
+
+\begin{thebibliography}{1}
+\providecommand{\natexlab}[1]{#1}
+\providecommand{\url}[1]{\texttt{#1}}
+\expandafter\ifx\csname urlstyle\endcsname\relax
+  \providecommand{\doi}[1]{doi: #1}\else
+  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
+
+\bibitem[Gorjanc(2007)]{Gorjanc}
+G.~Gorjanc.
+\newblock Working with unknown values: the gdata package.
+\newblock \emph{R News}, 1\penalty0 (1):\penalty0 ?--?, ? 2007.
+\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-?.pdf}.
+
+\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
+{R Development Core Team}.
+\newblock \emph{R Data Import/Export}, 2006.
+\newblock URL \url{http://cran.r-project.org/manuals.html}.
+\newblock ISBN 3-900051-10-0.
+
+\bibitem[Warnes.(2006)]{WarnesGdata}
+G.~R. Warnes.
+\newblock \emph{gdata: Various R programming tools for data manipulation},
+  2006.
+\newblock URL
+  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
+\newblock R package version 2.3.1. Includes R source code and/or documentation
+  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
+
+\end{thebibliography}
+
+\address{Gregor Gorjanc\\
+  University of Ljubljana, Slovenia\\
+\email{gre...@bf...}}
+
+\end{article}
+
+\end{document}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.