Menu β–Ύ β–΄

Tree [a977f3] master /
 History

HTTPS access


File Date Author Commit
 R 2018-06-08 boB Rudis boB Rudis [a977f3] better HTML file regex
 inst 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 man 2018-04-12 boB Rudis boB Rudis [da73ca] Experimental encoding support to help #4 along
 tests 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 .Rbuildignore 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 .codecov.yml 2018-04-12 boB Rudis boB Rudis [6ef504] initial commit
 .gitattributes 2018-04-12 boB Rudis boB Rudis [2cfb61] Update travis & appveyor configs
 .gitignore 2018-04-12 boB Rudis boB Rudis [6ef504] initial commit
 .travis.yml 2018-04-12 boB Rudis boB Rudis [2cfb61] Update travis & appveyor configs
 CONDUCT.md 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 DESCRIPTION 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 NAMESPACE 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 NEWS.md 2018-04-12 boB Rudis boB Rudis [6ef504] initial commit
 README.Rmd 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 README.md 2018-04-12 boB Rudis boB Rudis [b915b6] rookie mistake leaving hack-test code in a func...
 appveyor.yml 2018-04-12 boB Rudis boB Rudis [2cfb61] Update travis & appveyor configs
 codecov.yml 2018-04-12 boB Rudis boB Rudis [ddb0a6] spiffied it up a bit
 pubcrawl.Rproj 2018-04-12 boB Rudis boB Rudis [6ef504] initial commit

Read Me

---
output: rmarkdown::github_document
---

[![Travis-CI Build Status](https://travis-ci.org/hrbrmstr/pubcrawl.svg?branch=master)](https://travis-ci.org/hrbrmstr/pubcrawl) 
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/hrbrmstr/pubcrawl?branch=master&svg=true)](https://ci.appveyor.com/project/hrbrmstr/pubcrawl) 
[![Coverage Status](https://img.shields.io/codecov/c/github/hrbrmstr/pubcrawl/master.svg)](https://codecov.io/github/hrbrmstr/pubcrawl?branch=master)

# pubcrawl

Convert 'epub' Files to Text

## Description

Convert 'epub' Files to Text

The 'epub' file format is really just a structured 'ZIP' archive with 
metadata, graphics and (usually) 'HTML' text. Tools are provided to turn an 'epub'
file into a tidy data frame.

## What's Inside The Tin

The following functions are implemented:

- `epub_to_text`:	Convert an epub file into a data frame of plaintext chapters

## NOTE

There are edge cases I've totally not covered yet. Feel free to jump in and make this a real, useful package!

## TODO

- [X] Refactor so there aren't so many heavy dependencies
- <strike> [ ] Try to get `hgr` on CRAN so it's not a GH dep</strike> Moved the cleaner code into here
- [ ] Better docs
- [X] Embed some epubs for examples and tests
- [X] Setup Travis, Appveyor, code coverage

## Installation

```{r eval=FALSE}
devtools::install_github("hrbrmstr/pubcrawl")
```

```{r message=FALSE, warning=FALSE, error=FALSE, include=FALSE}
options(width=120)
```

## Usage

```{r message=FALSE, warning=FALSE, error=FALSE}
library(pubcrawl)
library(tidyverse)

# current verison
packageVersion("pubcrawl")

```

### An O'Reilly epub

```{r}
epub_to_text("~/Data/R Packages.epub")
```

### A Project Gutenberg epub that comes with the package

```{r}
epub_to_text(system.file("extdat", "augustine.epub", package="pubcrawl")) %>% 
  mutate(path = abbreviate(path))
```

## Code of Conduct

Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.
MongoDB Logo MongoDB