Writing with data

You will typically (if not always) need to summarize your work in writing. This page describes how to do so using R Markdown.

This is the third module in the Best Practices topic; the relevant slack channel is here.

Some slides

Writing with data from Jeff Goldsmith.

Example

Before jumping in, one short note about the default RStudio treatment of R Markdown documents – it behaves like a “notebook” and shows output mixed in with the code, rather than in the console or viewer. I don’t like this (and I don’t think I’m the only one), but I might just be old and set in my ways. These can be turned off using preferences > R Markdown > Show output inline.

Basic RMD

Below is a first RMD file. To follow along, create a .Rmd file using File > New File > R Markdown; replace the default text with what’s below, and save the file somewhere that makes sense (e.g. in a new directory).

---
title: "Simple document"
output: html_document
---

I'm an R Markdown document! 

# Section 1

Here's a **code chunk** that samples from 
a _normal distribution_:

```{r}
samp = rnorm(100)
length(samp)
```

# Section 2

I can take the mean of the sample, too!
The mean is `r mean(samp)`.

There are three major components to this file:

YAML header: The segment at the beginning of the document bracketed by ---s.
Text + inline R: Written text with simple formatting like # heading, **bold**, and _italic_
Code chunks: Blocks of code surrounded by ```

The combination of these elements allows you a great deal of flexibility and power as an author.

R Markdown documents are rendered to produce complete reports with text, formatting, and code results by “knitting”. You can knit your document using the RStudio GUI or CMD / Ctrl + Shift + K; these options execute knitr::knit (which you can run directly from the command line if you prefer). Behind the scenes, knitr is creating a Markdown document and pandoc is translating that to the output format you specify (e.g. HTML, .pdf, .docx).

Learning assessment: Take two minutes and create an R Markdown document as above. Make sure you can knit it and find the result in your local directory.

Code chunks and snippets

We’ll start with code chunks, since these are the distinguishing feature of R Markdown documents. The code chunks take the place of scripts in that they hold the code you use to produce your results. However, they tend to be briefer and more self-contained: you’re nestling these bits of code among the text the supports them and the results they produce. You can still execute code in chunks using Cmd/Ctrl + Enter, and you will still develop code by writing and refining until you have something you’re happy with.

Although the benefits will mostly become apparently later, I recommend you get in the habit of naming your code chunks now using {r chunk_name}.

Beyond the name, you can customize the behavior of your code chunk via options defined in the chunk header. Some common options are:

eval = FALSE: code will be displayed but not executed; results are not included.
echo = FALSE: code will be executed but not displayed; results are included.
include = FALSE: code will not be executed or displayed; results are not included.
message = FALSE and warning = FALSE: prevents messages and warnings from being displayed.
results = hide and fig.show = hide: prevents results and figures from being shown, respectively.
collapse = TRUE: output will be collapsed into a single block at shown at the end of the chunk.
error: errors in code will stop rendering when FALSE; errors in code will be printed in the doc when TRUE. The default is FALSE and you should almost never change it.

Use these options to be judicious about what you include in your report. Remember to keep your audience in mind: how much do they want or need to see?

You can also cache the results of a code chunk, but we will largely avoid this. Caching can save time by saving the results of a code chunk instead of re-executing when the document is knit. However, you have to be careful when using this option since downstream code can depend on upstream changes. Controlling this behavior through the dependson option can help, but if you cache code you’ll want to periodically clear you cache to ensure you’re getting reproducible results.

Inserting brief code snippets inline is often helpful; I frequently use these to give the sample size or summary statistics in text. You can insert code inline using `r `, often in conjunction with the format() function to clean up your output.

Learning assessment: Write a named code chunk that creates a variable containing a sample of size 500 from a random normal variable, takes the absolute value, and produces a histogram of the resulting variable. Add an inline summary giving the median value rounded to two decimal places. What happens if you set eval = FALSE to the code chunk? What about echo = FALSE?

Formatting text

There are a huge number of ways to format your documents. The overview below is essentially copied from R for Data Science; a link to a handy cheatsheet is below.

Text formatting 
------------------------------------------------------------

*italic*  or _italic_
**bold**   __bold__
`code`
superscript^2^ and subscript~2~


Headings
------------------------------------------------------------

# 1st Level Header

## 2nd Level Header

### 3rd Level Header


Lists
------------------------------------------------------------

*   Bulleted list item 1

*   Item 2

    * Item 2a

    * Item 2b

1.  Numbered list item 1

1.  Item 2. The numbers are incremented automatically in the output.


Tables 
------------------------------------------------------------

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

You’ll need to refer to this list (or to similar resources) pretty often at first, but most of it will become second-nature after you’ve written a few documents.

Learning assessment: After the previous code chunk, write a bullet list given the mean, median, and standard deviation of your random sample.

YAML and output formats

The YAML header controls global features of the document. I generally will include both the author and date in each document I produce.

author: "Jeff Goldsmith"
date: 2017-08-08

We’re mostly concerned with the output format, which is controlled through the output: field.

The snippet below will produce an HTML document. Notice that this has subfields to add a table of contents, and float that table alongside the content. These lines are used throughout the course website.

output:
  html_document:
    toc: true
    toc_float: true

HTML documents are great because they allow interactivity in a way that static formats (PDF, Word) do not. For example, adding the subfield code_folding: hide under html_document will hide all the code in the document until the reader clicks to show it.

That said, some collaborators will need or prefer static documents. You can create these using the YAML snippet below, which will produce a PDF or a Word document when knitted.

output:
  pdf_document: default

output:
  word_document: default

The formatting for both PDF and Word documents can be controlled through options as well, although these can be tricky to control (especially for Word documents).

Later in the course we’ll talk about some other output formats, especially dashboards and websites, and will introduce other YAML options as needed.

New workflow

All of this suggests a slight modification to the previous workflow:

Create a directory with a reasonable name and path (e.g. ~/Documents/School/DSI/Homework_2/)
Put an R Project in the directory
Keep everything related to the analysis – data inputs, scripts, reports, output – in there, and use R Markdown as much as possible
Periodically check for reproducibility of the analysis

This should become your default behavior when starting any new project.

Learning assessment: Convert the scripts for exploring vector classes and producing basic plots (from best practices) to self-contained R Markdown files that produce PDF documents.

Other materials

Constructing useful documents that combine text and code is the subject of several online guides. See below for a sampling:

The R Markdown cheatsheet is a useful resource once you have the basics down
R For Data Science devotes chapters to R Markdown, additional output formats, and a useful workflow.
The Intro to R Markdown from RStudio overlaps a lot with the previous bullet, but is also handy to review
This chapter from R Basics is a good intro to R Markdown
If Webinars are more to your liking, this intro and this follow-up are good

You’ll also find a lot of stuff online that has been written using R Markdown and, with a little digging, can often find the .RMD file as well. This is a great way to spot new tools and figure out how to incorporate them in your own documents!!

The code that I produced working examples in lecture is here.