Before starting more complex analyses and projects, you should establish some good habits.
This is the second module in the Best Practices topic; the relevant slack channel is here.
Last class we introduced several vectors. Looking at these with a critical eye, they’re not particularly well-named – it’s impossible to know the relationship between vec1
and vec4
without looking at them, and already I’ve forgotten which vector did what.
vec1 = 5:8
vec2 = c("My", "name", "is", "Jeff")
vec3 = c(TRUE, TRUE, TRUE, FALSE)
vec4 = factor(c("male", "male", "female", "female"))
Let’s rename, and tidy up a bit while we’re at it – you really don’t want to have a cluttered workspace, because you will forget what things are. This is also a good time to point out RStudio’s tabbed autocompletion – start typing a variable name and press Tab.
vec_numeric = 5:8
vec_char = c("My", "name", "is", "Jeff")
vec_logical = c(TRUE, TRUE, TRUE, FALSE)
vec_factor = factor(c("male", "male", "female", "female"))
ls()
## [1] "l" "mat1" "mat2" "mat3" "vec"
## [6] "vec_char" "vec_factor" "vec_logical" "vec_numeric" "vec1"
## [11] "vec2" "vec3" "vec4" "x" "y"
rm(list = c("vec1", "vec2", "vec3", "vec4"))
ls()
## [1] "l" "mat1" "mat2" "mat3" "vec"
## [6] "vec_char" "vec_factor" "vec_logical" "vec_numeric" "x"
## [11] "y"
You should always keep the complete code that was used for an analysis or project, no matter how brief. Your scripts should be self-contained – they should include everything you did to produce what you wanted to produce. The two scripts below repeat content from a previous example.
##################################################################
# June 2, 2017
# Jeff Goldsmith
#
# Script exploring vector classes
##################################################################
## create vectors
vec_numeric = 5:8
vec_char = c("My", "name", "is", "Jeff")
vec_logical = c(TRUE, TRUE, TRUE, FALSE)
vec_factor = factor(c("male", "male", "female", "female"))
## check class of vectors
class(vec_numeric)
class(vec_char)
class(vec_logical)
class(vec_factor)
## convert factor to numeric
as.numeric(vec_factor)
Note use of snake_case
naming convention. Whatever naming convention you like best, get in the habit of using it consistently.
##################################################################
# June 2, 2017
# Jeff Goldsmith
#
# Script producing basic plots
##################################################################
## set seed to ensure reproducibility
set.seed(1234)
## define x and y
x = rnorm(1000)
y = 1 + 2 * x + rnorm(1000, 0, .4)
## histogram of x
hist(x)
## scatterplot of y against x
plot(x, y)
The scripts above illustrate some “best practices” that you should adopt:
The organization of code into self-contained scripts is itself part of a bigger picture. Rather than focusing on the variables (or plots, or even complete analyses), focus on the code that produces them. Given whatever inputs you have (none for now, although later there will be data), your code should reliably produce whatever outputs you want.
To check how “real” your scripts are, clear out your workspace using rm(list = ls())
and re-run everything. If everything works, great; if not, address the issue by editing your script. I do this frequently, and often start scripts with a line that clears the workspace.
R will, by default, save everything in your workspace. I strongly suggest you turn this behavior off (Preferences > General > Save Workspace: Never). First, doing so will remove a crutch early on and help you focus on your scripts. Second, saving your workspace automatically and doing so in the background can get you in trouble if (or when) you deal with patient data.
A final word about scripts – not every line of code you write will (or should) make it to your scripts. It can take a few attempts to get code you like, and you don’t need to save the intermediate stuff. You also don’t need to save every bit of exploratory analysis you do – keep the stuff that improved your understanding, but discard the rest.
Your projects will generally consist of several related files – input data, scripts, results. It’s important to keep these organized, so you can find everything you need quickly and easily. R Projects, through RStudio, give you an easy way to do this.
Create a directory and save the previous two scripts into that directory. Script names, like variable names, should be descriptive (e.g. 20170602_data_classes.R
and 20170602_simple_plots.R
). The directory name should be descriptive as well, and it should be in a reasonable place on your computer (e.g. ~/Documents/First_Project/
). Create an R Project using File > New Project > Existing Directory and specifying the directory you just created.
For now, the best part about R Projects is that they force you to think about and organize the elements you need for your analysis or project. Double-clicking the new .Rproj
file will launch the R Project; you’ll see a Files tab in the usual Plots / Packages / Help / Viewer pane. Later, we’ll find R Projects useful in other ways.
To this point, we’ve been working entirely inside RStudio without a need to access anything on your computer. That’s allowed us to avoid a discussion of your Working Directory, but now we’ll address that too. As you’re working, R is sitting inside a single directory on your computer – it can find files in that directory or output files to that directory. Check your current working directory using.
getwd()
If you’ve opened your R Project, that should be your working directory. That’s great! Another bit of encouragement to keep your stuff organized. This means that anything you output will be written to this directory:
plot(x, y)
dev.print(pdf, "scatter_plot.pdf", height = 4, width = 4)
With larger projects, it can be useful to create sub-directories for your project (e.g. Data
and Figures
) – you don’t need to worry about that for now, but when you do this use relative paths to keep everything self-contained.
Note that we exported a figure using a command, rather than clicking the Export in the Plots tab. This is intentional – whatever your script produces, it should do so exclusively from the code you’ve written.
All this together suggests a workflow. For every new project (and I mean every new project), do the following:
~/Documents/School/DSI/Homework_2/
)Starting this habit early will save you a ton of headaches along the road. If you find yourself not using R, everything about except the R Project part holds.
This content draws on the work of others; there are also useful references for a lot of this online.
The code that I produced working examples in lecture is here.