[R-gregmisc-users] SF.net SVN: r-gregmisc:[1789] trunk/gdata

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 1789
          http://sourceforge.net/p/r-gregmisc/code/1789
Author:   warnes
Date:     2014-04-05 14:26:49 +0000 (Sat, 05 Apr 2014)
Log Message:
-----------
Move vignettes from inst/doc/ to vignettes/

Added Paths:
-----------
    trunk/gdata/vignettes/
    trunk/gdata/vignettes/mapLevels.Rnw
    trunk/gdata/vignettes/unknown.Rnw

Removed Paths:
-------------
    trunk/gdata/inst/doc/mapLevels.Rnw
    trunk/gdata/inst/doc/unknown.Rnw

Deleted: trunk/gdata/inst/doc/mapLevels.Rnw
===================================================================

--- trunk/gdata/inst/doc/mapLevels.Rnw	2014-04-05 13:57:10 UTC (rev 1788)
+++ trunk/gdata/inst/doc/mapLevels.Rnw	2014-04-05 14:26:49 UTC (rev 1789)
@@ -1,229 +0,0 @@
-
-%\VignetteIndexEntry{Mapping levels of a factor}
-%\VignettePackage{gdata}
-%\VignetteKeywords{levels, factor, manip}
-
-\documentclass[a4paper]{report}
-\usepackage{Rnews}
-\usepackage[round]{natbib}
-\bibliographystyle{abbrvnat}
-
-\usepackage{Sweave}
-\SweaveOpts{strip.white=all, keep.source=TRUE}
-
-\begin{document}
-
-\begin{article}
-
-\title{Mapping levels of a factor}
-\subtitle{The \pkg{gdata} package}
-\author{by Gregor Gorjanc}
-
-\maketitle
-
-\section{Introduction}
-
-Factors use levels attribute to store information on mapping between
-internal integer codes and character values i.e. levels. First level is
-mapped to internal integer code 1 and so on. Although some users do not
-like factors, their use is more efficient in terms of storage than for
-character vectors. Additionally, there are many functions in base \R{} that
-provide additional value for factors. Sometimes users need to work with
-internal integer codes and mapping them back to factor, especially when
-interfacing external programs. Mapping information is also of interest if
-there are many factors that should have the same set of levels. This note
-describes \code{mapLevels} function, which is an utility function for
-mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1}
-package \citep{WarnesGdata}.
-
-\section{Description with examples}
-
-Function \code{mapLevels()} is an (S3) generic function and works on
-\code{factor} and \code{character} atomic classes. It also works on
-\code{list} and \code{data.frame} objects with previously mentioned atomic
-classes. Function \code{mapLevels} produces a so called ``map'' with names
-and values. Names are levels, while values can be internal integer codes or
-(possibly other) levels. This will be clarified later on.  Class of this
-``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic
-or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame}
-classes. The following example shows the creation and printout of such a
-``map''.
-
-<<ex01>>=
-library(gdata)
-(fac <- factor(c("B", "A", "Z", "D")))
-(map <- mapLevels(x=fac))
-@
-
-If we have to work with internal integer codes, we can transform factor to
-integer and still get ``back the original factor'' with ``map'' used as
-argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-}
-is also an (S3) generic function and works on same classes as
-\code{mapLevels} plus \code{integer} atomic class.
-
-<<ex02>>=
-(int <- as.integer(fac))
-mapLevels(x=int) <- map
-int
-identical(fac, int)
-@
-
-Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow),
-but its print method unlists it for ease of inspection. ``Map'' from
-example has all components of length 1. This is not mandatory as
-\code{mapLevels<-} function is only a wrapper around workhorse function
-\code{levels<-} and the later can accept \code{list} with components of
-various lengths.
-
-<<ex03>>=
-str(map)
-@
-
-Although not of primary importance, this ``map'' can also be used to remap
-factor levels as shown bellow.  Components ``later'' in the map take over
-the ``previous'' ones. Since this is not optimal I would rather recommend
-other approaches for ``remapping'' the levels of a \code{factor}, say
-\code{recode} in \pkg{car} package \citep{FoxCar}.
-
-<<ex04>>=
-map[[2]] <- as.integer(c(1, 2))
-map
-int <- as.integer(fac)
-mapLevels(x=int) <- map
-int
-@
-
-Up to now examples showed ``map'' with internal integer codes for values
-and levels for names. I call this integer ``map''. On the other hand
-character ``map'' uses levels for values and (possibly other) levels for
-names. This feature is a bit odd at first sight, but can be used to easily
-unify levels and internal integer codes across several factors.  Imagine
-you have a factor that is for some reason split into two factors \code{f1}
-and \code{f2} and that each factor does not have all levels. This is not
-uncommon situation.
-
-<<ex05>>=
-(f1 <- factor(c("A", "D", "C")))
-(f2 <- factor(c("B", "D", "C")))
-@
-
-If we work with this factors, we need to be careful as they do not have the
-same set of levels. This can be solved with appropriately specifying
-\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B",
-  "C", "D")} or with proper use of \code{levels<-} function. I say proper
-as it is very tempting to use:
-
-<<ex06>>=
-fTest <- f1
-levels(fTest) <- c("A", "B", "C", "D")
-fTest
-@
-
-Above example extends set of levels, but also changes level of 2nd and 3rd
-element in \code{fTest}! Proper use of \code{levels<-} (as shown in
-\code{levels} help page) would be:
-
-<<ex07>>=
-fTest <- f1
-levels(fTest) <- list(A="A", B="B",
-                      C="C", D="D")
-fTest
-@
-
-Function \code{mapLevels} with character ``map'' can help us in such
-scenarios to unify levels and internal integer codes across several
-factors. Again the workhorse under this process is \code{levels<-} function
-from base \R{}! Function \code{mapLevels<-} just controls the assignment of
-(integer or character) ``map'' to \code{x}. Levels in \code{x} that match
-``map'' values (internal integer codes or levels) are changed to ``map''
-names (possibly other levels) as shown in \code{levels} help page. Levels
-that do not match are converted to \code{NA}. Integer ``map'' can be
-applied to \code{integer} or \code{factor}, while character ``map'' can be
-applied to \code{character} or \code{factor}. Result of \code{mapLevels<-}
-is always a \code{factor} with possibly ``remapped'' levels.
-
-To get one joint character ``map'' for several factors, we need to put
-factors in a \code{list} or \code{data.frame} and use arguments
-\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to
-unify levels and internal integer codes.
-
-<<ex08>>=
-(bigMap <- mapLevels(x=list(f1, f2),
-                     codes=FALSE,
-                     combine=TRUE))
-mapLevels(f1) <- bigMap
-mapLevels(f2) <- bigMap
-f1
-f2
-cbind(as.character(f1), as.integer(f1),
-      as.character(f2), as.integer(f2))
-@
-
-If we do not specify \code{combine=TRUE} (which is the default behaviour)
-and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels}
-returns ``map'' of class \code{listLevelsMap}. This is internally a
-\code{list} of ``maps'' (\code{levelsMap} objects). Both
-\code{listLevelsMap} and \code{levelsMap} objects can be passed to
-\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when
-length of \code{listLevelsMap} is not the same as number of
-components/columns of a \code{list}/\code{data.frame}.
-
-Additional convenience methods are also implemented to ease the work with
-``maps'':
-
-\begin{itemize}
-
-\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and
-  \code{as.listLevelsMap} for testing and coercion of user defined
-  ``maps'',
-
-\item \code{"["} for subsetting,
-
-\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap}
-  objects; argument \code{recursive=TRUE} can be used to coerce
-  \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2,
-    recursive=TRUE)} and
-
-\item \code{unique} and \code{sort} for \code{levelsMap}.
-
-\end{itemize}
-
-\section{Summary}
-
-Functions \code{mapLevels} and \code{mapLevels<-} can help users to map
-internal integer codes to factor levels and unify levels as well as
-internal integer codes among several factors. I welcome any comments or
-suggestions.
-
-% \bibliography{refs}
-\begin{thebibliography}{1}
-\providecommand{\natexlab}[1]{#1}
-\providecommand{\url}[1]{\texttt{#1}}
-\expandafter\ifx\csname urlstyle\endcsname\relax
-  \providecommand{\doi}[1]{doi: #1}\else
-  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
-
-\bibitem[Fox(2006)]{FoxCar}
-J.~Fox.
-\newblock \emph{car: Companion to Applied Regression}, 2006.
-\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}.
-\newblock R package version 1.1-1.
-
-\bibitem[Warnes(2006)]{WarnesGdata}
-G.~R. Warnes.
-\newblock \emph{gdata: Various R programming tools for data manipulation},
-  2006.
-\newblock URL
-  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
-\newblock R package version 2.3.1. Includes R source code and/or documentation
-  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
-
-\end{thebibliography}
-
-\address{Gregor Gorjanc\\
-  University of Ljubljana, Slovenia\\
-\email{gre...@bf...}}
-
-\end{article}
-
-\end{document}

Deleted: trunk/gdata/inst/doc/unknown.Rnw
===================================================================
--- trunk/gdata/inst/doc/unknown.Rnw	2014-04-05 13:57:10 UTC (rev 1788)
+++ trunk/gdata/inst/doc/unknown.Rnw	2014-04-05 14:26:49 UTC (rev 1789)
@@ -1,272 +0,0 @@
-
-%\VignetteIndexEntry{Working with Unknown Values}
-%\VignettePackage{gdata}
-%\VignetteKeywords{unknown, missing, manip}
-
-\documentclass[a4paper]{report}
-\usepackage{Rnews}
-\usepackage[round]{natbib}
-\bibliographystyle{abbrvnat}
-
-\usepackage{Sweave}
-\SweaveOpts{strip.white=all, keep.source=TRUE}
-
-\begin{document}
-
-\begin{article}
-
-\title{Working with Unknown Values}
-\subtitle{The \pkg{gdata} package}
-\author{by Gregor Gorjanc}
-
-\maketitle
-
-This vignette has been published as \cite{Gorjanc}.
-
-\section{Introduction}
-
-Unknown or missing values can be represented in various ways. For example
-SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
-Not Available. When we import data into \R{}, say via \code{read.table} or
-its derivatives, conversion of blank fields to \code{NA} (according to
-\code{read.table} help) is done for \code{logical}, \code{integer},
-\code{numeric} and \code{complex} classes. Additionally, the
-\code{na.strings} argument can be used to specify values that should also
-be converted to \code{NA}. Inversely, there is an argument \code{na} in
-\code{write.table} and its derivatives to define value that will replace
-\code{NA} in exported data. There are also other ways to import/export data
-into \R{} as described in the {\emph R Data Import/Export} manual
-\citep{RImportExportManual}.  However, all approaches lack the possibility
-to define unknown value(s) for some particular column. It is possible that
-an unknown value in one column is a valid value in another column. For
-example, I have seen many datasets where values such as 0, -9, 999 and
-specific dates are used as column specific unknown values.
-
-This note describes a set of functions in package \pkg{gdata}\footnote{
-  package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
-\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
-unknown values and conversions between unknown values and \code{NA}. All
-three functions are generic (S3) and were tested (at the time of writing)
-to work with: \code{integer}, \code{numeric}, \code{character},
-\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
-\code{data.frame} and \code{matrix} classes.
-
-\section{Description with examples}
-
-The following examples show simple usage of these functions on
-\code{numeric} and \code{factor} classes, where value \code{0} (beside
-\code{NA}) should be treated as an unknown value:
-
-<<ex01>>=
-library("gdata")
-xNum <- c(0, 6, 0, 7, 8, 9, NA)
-isUnknown(x=xNum)
-@
-
-The default unknown value in \code{isUnknown} is \code{NA}, which means
-that output is the same as \code{is.na} --- at least for atomic
-classes. However, we can pass the argument \code{unknown} to define which
-values should be treated as unknown:
-
-<<ex02>>=
-isUnknown(x=xNum, unknown=0)
-@
-
-This skipped \code{NA}, but we can get the expected answer after
-appropriately adding \code{NA} into the argument \code{unknown}:
-
-<<ex03>>=
-isUnknown(x=xNum, unknown=c(0, NA))
-@
-
-Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
-There is clearly no need to add \code{NA} here. This step is very handy
-after importing data from an external source, where many different unknown
-values might be used. Argument \code{warning=TRUE} can be used, if there is
-a need to be warned about ``original'' \code{NA}s:
-
-<<ex04>>=
-(xNum2 <- unknownToNA(x=xNum, unknown=0))
-@
-
-Prior to export from \R{}, we might want to change unknown values
-(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
-used for this:
-
-<<ex05>>=
-NAToUnknown(x=xNum2, unknown=999)
-@
-
-Converting \code{NA} to a value that already exists in \code{x} issues an
-error, but \code{force=TRUE} can be used to overcome this if needed. But be
-warned that there is no way back from this step:
-
-<<ex06>>=
-NAToUnknown(x=xNum2, unknown=7, force=TRUE)
-@
-
-Examples below show all peculiarities with class \code{factor}.
-\code{unknownToNA} removes \code{unknown} value from levels and inversely
-\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
-properly distinguished from \code{NA}. It can also be seen that the
-argument \code{unknown} in functions \code{isUnknown} and
-\code{unknownToNA} need not match the class of \code{x} (otherwise factor
-should be used) as the test is internally done with \code{\%in\%}, which
-nicely resolves coercing issues.
-
-<<ex07>>=
-(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
-isUnknown(x=xFac)
-isUnknown(x=xFac, unknown=0)
-isUnknown(x=xFac, unknown=c(0, NA))
-isUnknown(x=xFac, unknown=c(0, "NA"))
-isUnknown(x=xFac, unknown=c(0, "NA", NA))
-
-(xFac <- unknownToNA(x=xFac, unknown=0))
-(xFac <- NAToUnknown(x=xFac, unknown=0))
-@
-
-These two examples with classes \code{numeric} and \code{factor} are fairly
-simple and we could get the same results with one or two lines of \R{}
-code. The real benefit of the set of functions presented here is in
-\code{list} and \code{data.frame} methods, where \code{data.frame} methods
-are merely wrappers for \code{list} methods.
-
-We need additional flexibility for \code{list}/\code{data.frame} methods,
-due to possibly having multiple unknown values that can be different among
-\code{list} components or \code{data.frame} columns. For these two methods,
-the argument \code{unknown} can be either a \code{vector} or \code{list},
-both possibly named. Of course, greater flexibility (defining multiple
-unknown values per component/column) can be achieved with a \code{list}.
-
-When a \code{vector}/\code{list} object passed to the argument
-\code{unknown} is not named, the first value/component of a
-\code{vector}/\code{list} matches the first component/column of a
-\code{list}/\code{data.frame}. This can be quite error prone, especially
-with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
-case \code{vector}/\code{list} passed to argument \code{unknown} is named,
-names are matched to names of \code{list} or \code{data.frame}. If lengths
-of \code{unknown} and \code{list} or \code{data.frame} do not match,
-recycling occurs.
-
-The example below illustrates the application of the described functions to
-a list which is composed of previously defined and modified numeric
-(\code{xNum}) and factor (\code{xFac}) classes. First, function
-\code{isUnknown} is used with \code{0} as an unknown value. Note that we
-get \code{FALSE} for \code{NA}s as has been the case in the first example.
-
-<<ex08>>=
-(xList <- list(a=xNum, b=xFac))
-isUnknown(x=xList, unknown=0)
-@
-
-We need to add \code{NA} as an unknown value. However, we do not get the
-expected result this way!
-
-<<ex09>>=
-isUnknown(x=xList, unknown=c(0, NA))
-@
-
-This is due to matching of values in the argument \code{unknown} and
-components in a \code{list}; i.e., \code{0} is used for component \code{a}
-and \code{NA} for component \code{b}.  Therefore, it is less error prone
-and more flexible to pass a \code{list} (preferably a named list) to the
-argument \code{unknown}, as shown below.
-
-<<ex10>>=
-(xList1 <- unknownToNA(x=xList,
-                       unknown=list(b=c(0, "NA"),
-                                    a=0)))
-@
-
-Changing \code{NA}s to some other value (only one per component/column) can
-be accomplished as follows:
-
-<<ex11>>=
-NAToUnknown(x=xList1,
-            unknown=list(b="no", a=0))
-@
-
-A named component \code{.default} of a \code{list} passed to argument
-\code{unknown} has a special meaning as it will match a component/column
-with that name and any other not defined in \code{unknown}. As such it is
-very useful if the number of components/columns with the same unknown
-value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
-\code{.default} can be used to define unknown value for several columns:
-
-<<ex12, echo=FALSE>>=
-df <- data.frame(col1=c(0, 1, 999, 2),
-                 col2=c("a", "b", "c", "unknown"),
-                 col3=c(0, 1, 2, 3),
-                 col4=c(0, 1, 2, 2))
-@
-
-<<ex13>>=
-tmp <- list(.default=0,
-            col1=999,
-            col2="unknown")
-(df2 <- unknownToNA(x=df,
-                    unknown=tmp))
-@
-
-If there is a need to work only on some components/columns you can of
-course ``skip'' columns with standard \R{} mechanisms, i.e.,
-by subsetting \code{list} or \code{data.frame} objects:
-
-<<ex14>>=
-df2 <- df
-cols <- c("col1", "col2")
-tmp <- list(col1=999,
-            col2="unknown")
-df2[, cols] <- unknownToNA(x=df[, cols],
-                           unknown=tmp)
-df2
-@
-
-\section{Summary}
-
-Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
-provide a useful interface to work with various representations of
-unknown/missing values. Their use is meant primarily for shaping the data
-after importing to or before exporting from \R{}. I welcome any comments or
-suggestions.
-
-% \bibliography{refs}
-
-\begin{thebibliography}{1}
-\providecommand{\natexlab}[1]{#1}
-\providecommand{\url}[1]{\texttt{#1}}
-\expandafter\ifx\csname urlstyle\endcsname\relax
-  \providecommand{\doi}[1]{doi: #1}\else
-  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
-
-\bibitem[Gorjanc(2007)]{Gorjanc}
-G.~Gorjanc.
-\newblock Working with unknown values: the gdata package.
-\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
-\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.
-
-\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
-{R Development Core Team}.
-\newblock \emph{R Data Import/Export}, 2006.
-\newblock URL \url{http://cran.r-project.org/manuals.html}.
-\newblock ISBN 3-900051-10-0.
-
-\bibitem[Warnes (2006)]{WarnesGdata}
-G.~R. Warnes.
-\newblock \emph{gdata: Various R programming tools for data manipulation},
-  2006.
-\newblock URL
-  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
-\newblock R package version 2.3.1. Includes R source code and/or documentation
-  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
-
-\end{thebibliography}
-
-\address{Gregor Gorjanc\\
-  University of Ljubljana, Slovenia\\
-\email{gre...@bf...}}
-
-\end{article}
-
-\end{document}

Copied: trunk/gdata/vignettes/mapLevels.Rnw (from rev 1786, trunk/gdata/inst/doc/mapLevels.Rnw)
===================================================================
--- trunk/gdata/vignettes/mapLevels.Rnw	                        (rev 0)
+++ trunk/gdata/vignettes/mapLevels.Rnw	2014-04-05 14:26:49 UTC (rev 1789)
@@ -0,0 +1,229 @@
+
+%\VignetteIndexEntry{Mapping levels of a factor}
+%\VignettePackage{gdata}
+%\VignetteKeywords{levels, factor, manip}
+
+\documentclass[a4paper]{report}
+\usepackage{Rnews}
+\usepackage[round]{natbib}
+\bibliographystyle{abbrvnat}
+
+\usepackage{Sweave}
+\SweaveOpts{strip.white=all, keep.source=TRUE}
+
+\begin{document}
+
+\begin{article}
+
+\title{Mapping levels of a factor}
+\subtitle{The \pkg{gdata} package}
+\author{by Gregor Gorjanc}
+
+\maketitle
+
+\section{Introduction}
+
+Factors use levels attribute to store information on mapping between
+internal integer codes and character values i.e. levels. First level is
+mapped to internal integer code 1 and so on. Although some users do not
+like factors, their use is more efficient in terms of storage than for
+character vectors. Additionally, there are many functions in base \R{} that
+provide additional value for factors. Sometimes users need to work with
+internal integer codes and mapping them back to factor, especially when
+interfacing external programs. Mapping information is also of interest if
+there are many factors that should have the same set of levels. This note
+describes \code{mapLevels} function, which is an utility function for
+mapping the levels of a factor in \pkg{gdata} \footnote{from version 2.3.1}
+package \citep{WarnesGdata}.
+
+\section{Description with examples}
+
+Function \code{mapLevels()} is an (S3) generic function and works on
+\code{factor} and \code{character} atomic classes. It also works on
+\code{list} and \code{data.frame} objects with previously mentioned atomic
+classes. Function \code{mapLevels} produces a so called ``map'' with names
+and values. Names are levels, while values can be internal integer codes or
+(possibly other) levels. This will be clarified later on.  Class of this
+``map'' is \code{levelsMap}, if \code{x} in \code{mapLevels()} was atomic
+or \code{listLevelsMap} otherwise - for \code{list} and \code{data.frame}
+classes. The following example shows the creation and printout of such a
+``map''.
+
+<<ex01>>=
+library(gdata)
+(fac <- factor(c("B", "A", "Z", "D")))
+(map <- mapLevels(x=fac))
+@
+
+If we have to work with internal integer codes, we can transform factor to
+integer and still get ``back the original factor'' with ``map'' used as
+argument in \code{mapLevels<-} function as shown bellow. \code{mapLevels<-}
+is also an (S3) generic function and works on same classes as
+\code{mapLevels} plus \code{integer} atomic class.
+
+<<ex02>>=
+(int <- as.integer(fac))
+mapLevels(x=int) <- map
+int
+identical(fac, int)
+@
+
+Internally ``map'' (\code{levelsMap} class) is a \code{list} (see bellow),
+but its print method unlists it for ease of inspection. ``Map'' from
+example has all components of length 1. This is not mandatory as
+\code{mapLevels<-} function is only a wrapper around workhorse function
+\code{levels<-} and the later can accept \code{list} with components of
+various lengths.
+
+<<ex03>>=
+str(map)
+@
+
+Although not of primary importance, this ``map'' can also be used to remap
+factor levels as shown bellow.  Components ``later'' in the map take over
+the ``previous'' ones. Since this is not optimal I would rather recommend
+other approaches for ``remapping'' the levels of a \code{factor}, say
+\code{recode} in \pkg{car} package \citep{FoxCar}.
+
+<<ex04>>=
+map[[2]] <- as.integer(c(1, 2))
+map
+int <- as.integer(fac)
+mapLevels(x=int) <- map
+int
+@
+
+Up to now examples showed ``map'' with internal integer codes for values
+and levels for names. I call this integer ``map''. On the other hand
+character ``map'' uses levels for values and (possibly other) levels for
+names. This feature is a bit odd at first sight, but can be used to easily
+unify levels and internal integer codes across several factors.  Imagine
+you have a factor that is for some reason split into two factors \code{f1}
+and \code{f2} and that each factor does not have all levels. This is not
+uncommon situation.
+
+<<ex05>>=
+(f1 <- factor(c("A", "D", "C")))
+(f2 <- factor(c("B", "D", "C")))
+@
+
+If we work with this factors, we need to be careful as they do not have the
+same set of levels. This can be solved with appropriately specifying
+\code{levels} argument in creation of factors i.e. \code{levels=c("A", "B",
+  "C", "D")} or with proper use of \code{levels<-} function. I say proper
+as it is very tempting to use:
+
+<<ex06>>=
+fTest <- f1
+levels(fTest) <- c("A", "B", "C", "D")
+fTest
+@
+
+Above example extends set of levels, but also changes level of 2nd and 3rd
+element in \code{fTest}! Proper use of \code{levels<-} (as shown in
+\code{levels} help page) would be:
+
+<<ex07>>=
+fTest <- f1
+levels(fTest) <- list(A="A", B="B",
+                      C="C", D="D")
+fTest
+@
+
+Function \code{mapLevels} with character ``map'' can help us in such
+scenarios to unify levels and internal integer codes across several
+factors. Again the workhorse under this process is \code{levels<-} function
+from base \R{}! Function \code{mapLevels<-} just controls the assignment of
+(integer or character) ``map'' to \code{x}. Levels in \code{x} that match
+``map'' values (internal integer codes or levels) are changed to ``map''
+names (possibly other levels) as shown in \code{levels} help page. Levels
+that do not match are converted to \code{NA}. Integer ``map'' can be
+applied to \code{integer} or \code{factor}, while character ``map'' can be
+applied to \code{character} or \code{factor}. Result of \code{mapLevels<-}
+is always a \code{factor} with possibly ``remapped'' levels.
+
+To get one joint character ``map'' for several factors, we need to put
+factors in a \code{list} or \code{data.frame} and use arguments
+\code{codes=FALSE} and \code{combine=TRUE}. Such map can then be used to
+unify levels and internal integer codes.
+
+<<ex08>>=
+(bigMap <- mapLevels(x=list(f1, f2),
+                     codes=FALSE,
+                     combine=TRUE))
+mapLevels(f1) <- bigMap
+mapLevels(f2) <- bigMap
+f1
+f2
+cbind(as.character(f1), as.integer(f1),
+      as.character(f2), as.integer(f2))
+@
+
+If we do not specify \code{combine=TRUE} (which is the default behaviour)
+and \code{x} is a \code{list} or \code{data.frame}, \code{mapLevels}
+returns ``map'' of class \code{listLevelsMap}. This is internally a
+\code{list} of ``maps'' (\code{levelsMap} objects). Both
+\code{listLevelsMap} and \code{levelsMap} objects can be passed to
+\code{mapLevels<-} for \code{list}/\code{data.frame}. Recycling occurs when
+length of \code{listLevelsMap} is not the same as number of
+components/columns of a \code{list}/\code{data.frame}.
+
+Additional convenience methods are also implemented to ease the work with
+``maps'':
+
+\begin{itemize}
+
+\item \code{is.levelsMap}, \code{is.listLevelsMap}, \code{as.levelsMap} and
+  \code{as.listLevelsMap} for testing and coercion of user defined
+  ``maps'',
+
+\item \code{"["} for subsetting,
+
+\item \code{c} for combining \code{levelsMap} or \code{listLevelsMap}
+  objects; argument \code{recursive=TRUE} can be used to coerce
+  \code{listLevelsMap} to \code{levelsMap}, for example \code{c(llm1, llm2,
+    recursive=TRUE)} and
+
+\item \code{unique} and \code{sort} for \code{levelsMap}.
+
+\end{itemize}
+
+\section{Summary}
+
+Functions \code{mapLevels} and \code{mapLevels<-} can help users to map
+internal integer codes to factor levels and unify levels as well as
+internal integer codes among several factors. I welcome any comments or
+suggestions.
+
+% \bibliography{refs}
+\begin{thebibliography}{1}
+\providecommand{\natexlab}[1]{#1}
+\providecommand{\url}[1]{\texttt{#1}}
+\expandafter\ifx\csname urlstyle\endcsname\relax
+  \providecommand{\doi}[1]{doi: #1}\else
+  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
+
+\bibitem[Fox(2006)]{FoxCar}
+J.~Fox.
+\newblock \emph{car: Companion to Applied Regression}, 2006.
+\newblock URL \url{http://socserv.socsci.mcmaster.ca/jfox/}.
+\newblock R package version 1.1-1.
+
+\bibitem[Warnes(2006)]{WarnesGdata}
+G.~R. Warnes.
+\newblock \emph{gdata: Various R programming tools for data manipulation},
+  2006.
+\newblock URL
+  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
+\newblock R package version 2.3.1. Includes R source code and/or documentation
+  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
+
+\end{thebibliography}
+
+\address{Gregor Gorjanc\\
+  University of Ljubljana, Slovenia\\
+\email{gre...@bf...}}
+
+\end{article}
+
+\end{document}

Copied: trunk/gdata/vignettes/unknown.Rnw (from rev 1786, trunk/gdata/inst/doc/unknown.Rnw)
===================================================================
--- trunk/gdata/vignettes/unknown.Rnw	                        (rev 0)
+++ trunk/gdata/vignettes/unknown.Rnw	2014-04-05 14:26:49 UTC (rev 1789)
@@ -0,0 +1,272 @@
+
+%\VignetteIndexEntry{Working with Unknown Values}
+%\VignettePackage{gdata}
+%\VignetteKeywords{unknown, missing, manip}
+
+\documentclass[a4paper]{report}
+\usepackage{Rnews}
+\usepackage[round]{natbib}
+\bibliographystyle{abbrvnat}
+
+\usepackage{Sweave}
+\SweaveOpts{strip.white=all, keep.source=TRUE}
+
+\begin{document}
+
+\begin{article}
+
+\title{Working with Unknown Values}
+\subtitle{The \pkg{gdata} package}
+\author{by Gregor Gorjanc}
+
+\maketitle
+
+This vignette has been published as \cite{Gorjanc}.
+
+\section{Introduction}
+
+Unknown or missing values can be represented in various ways. For example
+SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
+Not Available. When we import data into \R{}, say via \code{read.table} or
+its derivatives, conversion of blank fields to \code{NA} (according to
+\code{read.table} help) is done for \code{logical}, \code{integer},
+\code{numeric} and \code{complex} classes. Additionally, the
+\code{na.strings} argument can be used to specify values that should also
+be converted to \code{NA}. Inversely, there is an argument \code{na} in
+\code{write.table} and its derivatives to define value that will replace
+\code{NA} in exported data. There are also other ways to import/export data
+into \R{} as described in the {\emph R Data Import/Export} manual
+\citep{RImportExportManual}.  However, all approaches lack the possibility
+to define unknown value(s) for some particular column. It is possible that
+an unknown value in one column is a valid value in another column. For
+example, I have seen many datasets where values such as 0, -9, 999 and
+specific dates are used as column specific unknown values.
+
+This note describes a set of functions in package \pkg{gdata}\footnote{
+  package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
+\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
+unknown values and conversions between unknown values and \code{NA}. All
+three functions are generic (S3) and were tested (at the time of writing)
+to work with: \code{integer}, \code{numeric}, \code{character},
+\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
+\code{data.frame} and \code{matrix} classes.
+
+\section{Description with examples}
+
+The following examples show simple usage of these functions on
+\code{numeric} and \code{factor} classes, where value \code{0} (beside
+\code{NA}) should be treated as an unknown value:
+
+<<ex01>>=
+library("gdata")
+xNum <- c(0, 6, 0, 7, 8, 9, NA)
+isUnknown(x=xNum)
+@
+
+The default unknown value in \code{isUnknown} is \code{NA}, which means
+that output is the same as \code{is.na} --- at least for atomic
+classes. However, we can pass the argument \code{unknown} to define which
+values should be treated as unknown:
+
+<<ex02>>=
+isUnknown(x=xNum, unknown=0)
+@
+
+This skipped \code{NA}, but we can get the expected answer after
+appropriately adding \code{NA} into the argument \code{unknown}:
+
+<<ex03>>=
+isUnknown(x=xNum, unknown=c(0, NA))
+@
+
+Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
+There is clearly no need to add \code{NA} here. This step is very handy
+after importing data from an external source, where many different unknown
+values might be used. Argument \code{warning=TRUE} can be used, if there is
+a need to be warned about ``original'' \code{NA}s:
+
+<<ex04>>=
+(xNum2 <- unknownToNA(x=xNum, unknown=0))
+@
+
+Prior to export from \R{}, we might want to change unknown values
+(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
+used for this:
+
+<<ex05>>=
+NAToUnknown(x=xNum2, unknown=999)
+@
+
+Converting \code{NA} to a value that already exists in \code{x} issues an
+error, but \code{force=TRUE} can be used to overcome this if needed. But be
+warned that there is no way back from this step:
+
+<<ex06>>=
+NAToUnknown(x=xNum2, unknown=7, force=TRUE)
+@
+
+Examples below show all peculiarities with class \code{factor}.
+\code{unknownToNA} removes \code{unknown} value from levels and inversely
+\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
+properly distinguished from \code{NA}. It can also be seen that the
+argument \code{unknown} in functions \code{isUnknown} and
+\code{unknownToNA} need not match the class of \code{x} (otherwise factor
+should be used) as the test is internally done with \code{\%in\%}, which
+nicely resolves coercing issues.
+
+<<ex07>>=
+(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
+isUnknown(x=xFac)
+isUnknown(x=xFac, unknown=0)
+isUnknown(x=xFac, unknown=c(0, NA))
+isUnknown(x=xFac, unknown=c(0, "NA"))
+isUnknown(x=xFac, unknown=c(0, "NA", NA))
+
+(xFac <- unknownToNA(x=xFac, unknown=0))
+(xFac <- NAToUnknown(x=xFac, unknown=0))
+@
+
+These two examples with classes \code{numeric} and \code{factor} are fairly
+simple and we could get the same results with one or two lines of \R{}
+code. The real benefit of the set of functions presented here is in
+\code{list} and \code{data.frame} methods, where \code{data.frame} methods
+are merely wrappers for \code{list} methods.
+
+We need additional flexibility for \code{list}/\code{data.frame} methods,
+due to possibly having multiple unknown values that can be different among
+\code{list} components or \code{data.frame} columns. For these two methods,
+the argument \code{unknown} can be either a \code{vector} or \code{list},
+both possibly named. Of course, greater flexibility (defining multiple
+unknown values per component/column) can be achieved with a \code{list}.
+
+When a \code{vector}/\code{list} object passed to the argument
+\code{unknown} is not named, the first value/component of a
+\code{vector}/\code{list} matches the first component/column of a
+\code{list}/\code{data.frame}. This can be quite error prone, especially
+with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
+case \code{vector}/\code{list} passed to argument \code{unknown} is named,
+names are matched to names of \code{list} or \code{data.frame}. If lengths
+of \code{unknown} and \code{list} or \code{data.frame} do not match,
+recycling occurs.
+
+The example below illustrates the application of the described functions to
+a list which is composed of previously defined and modified numeric
+(\code{xNum}) and factor (\code{xFac}) classes. First, function
+\code{isUnknown} is used with \code{0} as an unknown value. Note that we
+get \code{FALSE} for \code{NA}s as has been the case in the first example.
+
+<<ex08>>=
+(xList <- list(a=xNum, b=xFac))
+isUnknown(x=xList, unknown=0)
+@
+
+We need to add \code{NA} as an unknown value. However, we do not get the
+expected result this way!
+
+<<ex09>>=
+isUnknown(x=xList, unknown=c(0, NA))
+@
+
+This is due to matching of values in the argument \code{unknown} and
+components in a \code{list}; i.e., \code{0} is used for component \code{a}
+and \code{NA} for component \code{b}.  Therefore, it is less error prone
+and more flexible to pass a \code{list} (preferably a named list) to the
+argument \code{unknown}, as shown below.
+
+<<ex10>>=
+(xList1 <- unknownToNA(x=xList,
+                       unknown=list(b=c(0, "NA"),
+                                    a=0)))
+@
+
+Changing \code{NA}s to some other value (only one per component/column) can
+be accomplished as follows:
+
+<<ex11>>=
+NAToUnknown(x=xList1,
+            unknown=list(b="no", a=0))
+@
+
+A named component \code{.default} of a \code{list} passed to argument
+\code{unknown} has a special meaning as it will match a component/column
+with that name and any other not defined in \code{unknown}. As such it is
+very useful if the number of components/columns with the same unknown
+value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
+\code{.default} can be used to define unknown value for several columns:
+
+<<ex12, echo=FALSE>>=
+df <- data.frame(col1=c(0, 1, 999, 2),
+                 col2=c("a", "b", "c", "unknown"),
+                 col3=c(0, 1, 2, 3),
+                 col4=c(0, 1, 2, 2))
+@
+
+<<ex13>>=
+tmp <- list(.default=0,
+            col1=999,
+            col2="unknown")
+(df2 <- unknownToNA(x=df,
+                    unknown=tmp))
+@
+
+If there is a need to work only on some components/columns you can of
+course ``skip'' columns with standard \R{} mechanisms, i.e.,
+by subsetting \code{list} or \code{data.frame} objects:
+
+<<ex14>>=
+df2 <- df
+cols <- c("col1", "col2")
+tmp <- list(col1=999,
+            col2="unknown")
+df2[, cols] <- unknownToNA(x=df[, cols],
+                           unknown=tmp)
+df2
+@
+
+\section{Summary}
+
+Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
+provide a useful interface to work with various representations of
+unknown/missing values. Their use is meant primarily for shaping the data
+after importing to or before exporting from \R{}. I welcome any comments or
+suggestions.
+
+% \bibliography{refs}
+
+\begin{thebibliography}{1}
+\providecommand{\natexlab}[1]{#1}
+\providecommand{\url}[1]{\texttt{#1}}
+\expandafter\ifx\csname urlstyle\endcsname\relax
+  \providecommand{\doi}[1]{doi: #1}\else
+  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
+
+\bibitem[Gorjanc(2007)]{Gorjanc}
+G.~Gorjanc.
+\newblock Working with unknown values: the gdata package.
+\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
+\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.
+
+\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
+{R Development Core Team}.
+\newblock \emph{R Data Import/Export}, 2006.
+\newblock URL \url{http://cran.r-project.org/manuals.html}.
+\newblock ISBN 3-900051-10-0.
+
+\bibitem[Warnes (2006)]{WarnesGdata}
+G.~R. Warnes.
+\newblock \emph{gdata: Various R programming tools for data manipulation},
+  2006.
+\newblock URL
+  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
+\newblock R package version 2.3.1. Includes R source code and/or documentation
+  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.
+
+\end{thebibliography}
+
+\address{Gregor Gorjanc\\
+  University of Ljubljana, Slovenia\\
+\email{gre...@bf...}}
+
+\end{article}
+
+\end{document}

This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.