# Data Analysis and Visualization Using R: Lesson 1

David Robinson
1/27/14

### How to Read These Slides

In these slides, we show blocks of R code, which are immediately followed by their output:

``````print("hello world")
``````
`````` "hello world"
``````

The gray box shows the original R code, which you can copy and paste into your own R console to try yourself. The white box shows the code's output: you can compare it to your own results (or just trust us that that's the output).

## Numeric variables

### Assigning a variable

You store a value in a variable using the `=` operator:

``````x = 42
``````

This gives the variable `a` a value of `42`. You can show the value of `a` with:

``````print(x)
``````
`````` 42
``````

You can also assign a variable with `<-`: this is equivalent.

``````x <- 42
``````

### Variable names

Variable names consist of letters, digits, periods and underscores (`_`), and cannot start with a digit. Convention is to use periods as spaces.

Legal variable names include:

• my.variable
• my_variable

Illegal names include:

• my-variable
• dave's.variable
• 2ndvariable

### Using R like a scientific calculator

You can perform mathematical operations using `+`, `-`, `*`, and `/`:

``````x = 6 + 4
print(x)
``````
`````` 10
``````
``````x / 2
``````
`````` 5
``````
``````y = 4
x / y
``````
`````` 2.5
``````

### Using R like a scientific calculator

You can use exponentiation with `^`, or calculate the natural log:

``````x^2
``````
`````` 100
``````
``````y^3
``````
`````` 64
``````
``````log(x)
``````
`````` 2.303
``````

### Assigning variables: FAQ

• What is the difference between `<-` and `=`?
• In 99% of cases, they act exactly the same, so it's personal preference. See here to see a description of the rare cases where they differ.
• When do you need `print(x)` to display a variable, and when `x`?
• When working in the R interactive terminal, the result of each line are displayed after being evaluated- `print` is unnecessary. When you source a .R file, you need `print(x)` in the line or it won't display.

### Assigning variables: FAQ

• Why is there a `` before each result?
• You'll find out in the next section!

## Vectors

You may have noticed the `` at the start of each result. That's because all numbers in R are actually represented as vectors of length 1. The `` is there to indicate rows of results.

### Vectors

For example, you can use `:` to create a long vector of consecutive integers:

``````1:60
``````
``````   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
 52 53 54 55 56 57 58 59 60
``````

The ``, ```` at the start of each row helps keep track of the position within the vector.

### Creating and combining vectors

You can also create vectors yourself using `c`:

``````v1 = c(1, 2, 5, 7)
v2 = c(8, 6, 3, 2)
``````

You can also use `c` to combine existing vectors together:

``````v3 = c(v1, v2)
print(v3)
``````
`````` 1 2 5 7 8 6 3 2
``````

### Extracting from vectors

Use square brackets to retrieve a value from a vector, or multiple values:

``````v3
``````
`````` 1 2 5 7 8 6 3 2
``````
``````v3
``````
`````` 7
``````
``````v3[4:7]
``````
`````` 7 8 6 3
``````

### Operations on vectors

Mathematical operations on a vector apply to all elements:

``````v1 = c(1, 2, 5, 7)
v1 + 2
``````
`````` 3 4 7 9
``````
``````v1 / 2
``````
`````` 0.5 1.0 2.5 3.5
``````
``````sin(v1)
``````
``````  0.8415  0.9093 -0.9589  0.6570
``````

### Operations on vectors

Similarly, you can perform operations between two vectors:

``````v1
``````
`````` 1 2 5 7
``````
``````v2 = c(8, 6, 3, 2)
v1 + v2
``````
`````` 9 8 8 9
``````
``````v1 / v2
``````
`````` 0.1250 0.3333 1.6667 3.5000
``````

### Operations on vectors

You can also easily summarize a vector by calculating the sum, mean, or length:

``````sum(v3)
``````
`````` 34
``````
``````mean(v3)
``````
`````` 4.25
``````
``````length(v3)
``````
`````` 8
``````

### Character vectors

Not all values you could want to store in R are numeric. You could store:

• subject names
• gene sequences
• text for analysis

We represent these as a series of characters (letters, digits, punctuation, etc).

### Assigning a character vector

Character vectors are surrounded by either single or double quotation marks.

``````chv = "hello"
chv2 = 'hi'
chv3 = c("hello", "world")
``````

Like numeric values, they are always vectors, though sometimes they are of length 1.

## Matrices

Matrices are like two-dimensional vectors, organizing values into rows and columns:

``````m = matrix(1:9, ncol=3)
m
``````
``````     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
``````

### Attributes of a matrix

You can get the number of rows, the number of columns, or both:

``````NROW(m)
``````
`````` 3
``````
``````NCOL(m)
``````
`````` 3
``````
``````dim(m)
``````
`````` 3 3
``````

### Retrieving a value

To extract one value from a matrix, use the structure `matrix[`row`,`column`]`.

``````m
``````
``````     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
``````
``````m[1, 3]
``````
`````` 7
``````

### Retrieving a row or column

Leaving the “row” spot or the “column” spot empty will extract, respectively, an entire column or an entire row.

``````m[1, ]
``````
`````` 1 4 7
``````
``````m[, 2]
``````
`````` 4 5 6
``````

### Matrix arithmetic

You can add or multiply a single value by a matrix:

``````m + 3
``````
``````     [,1] [,2] [,3]
[1,]    4    7   10
[2,]    5    8   11
[3,]    6    9   12
``````
``````m * 2
``````
``````     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18
``````

### Transpose and diagonal

Use the `t` function to transpose a matrix:

``````t(m)
``````
``````     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
``````

Use `diag` to extract the diagonal:

``````diag(m)
``````
`````` 1 5 9
``````

### Matrix multiplication

You can also perform traditional matrix multiplication with the `%*%` operator

``````m2 = matrix(21:32, nrow=3)
m %*% m2
``````
``````     [,1] [,2] [,3] [,4]
[1,]  270  306  342  378
[2,]  336  381  426  471
[3,]  402  456  510  564
``````

## Logical vectors

Another type of variable is a logical value: `TRUE` or `FALSE`. Like numbers, logical values are always stored in vectors (sometimes of length 1).

``````x = TRUE
y = c(TRUE, FALSE, TRUE)
``````

### Logical operators

Logical vectors are useful because they are the result of logical operators, such as

• `>` : greater than
• `<` : less than
• `==` : equal to
• `!=` : not equal to
• `&` : and
• `|` : or

### Logical operators: comparison

``````x = 2  # assignment
x > 0
``````
`````` TRUE
``````
``````x < 1
``````
`````` FALSE
``````
``````x != 10
``````
`````` TRUE
``````

### Logical operators FAQ

• Why is the logical operator for equals `==` and not `=`?
• Because `=` is already reserved for assignment.

## Data frames

Data frames store multiple columns of information together. Unlike a matrix, different columns in a data frame can store different kinds of information (numbers, factors, character vectors, etc)

### Built-in Datasets

R comes with built-in datasets that can be retrieved by name. You can access one with the `data` function.

``````data(mtcars)
``````

`mtcars` contains statistics about 32 cars in 1974, including miles per gallon, weight, number of cylinders, etc. Each row is one car, and each column one piece of information.

### View data frame in RStudio

``````View(mtcars)
``````

See details and documentation about the data with:

``````?mtcars
``````

or

``````help(mtcars)
``````

### See first rows of data frame

One of the most useful functions is `head`, which shows the first 6 rows of a data frame (a good way to get an idea of its contents):

``````head(mtcars)
``````
``````                   mpg cyl disp  hp drat    wt  qsec
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
Datsun 710        22.8   4  108  93 3.85 2.320 18.61
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
Valiant           18.1   6  225 105 2.76 3.460 20.22
vs am gear carb
Mazda RX4          0  1    4    4
Mazda RX4 Wag      0  1    4    4
Datsun 710         1  1    4    1
Hornet 4 Drive     1  0    3    1
Hornet Sportabout  0  0    3    2
Valiant            1  0    3    1
``````

### Information about a data frame

Get the number of rows, columns or both:

``````nrow(mtcars)
``````
`````` 32
``````
``````ncol(mtcars)
``````
`````` 11
``````
``````dim(mtcars)
``````
`````` 32 11
``````

### Access a column by name

Use `\$` to access one column by name:

``````mtcars\$mpg
``````
``````  21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
 15.0 21.4
``````

Each column is a vector once it is extracted.

### Access one row or value

You can use square brackets with a comma to access a single row of a data frame:

``````mtcars[1, ]
``````
``````          mpg cyl disp  hp drat   wt  qsec vs am gear
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4
carb
Mazda RX4    4
``````

### Access one row or value

Or you can give `row, column` to get a single value at a particular position:

``````mtcars[3, 2]
``````
`````` 4
``````

## Filtering a data frame

One common operation on data is to filter out rows based on some criterion.

### Subsetting rows of a data frame

You can get a set of rows using their indices:

``````mtcars[1:2, ]
``````
``````              mpg cyl disp  hp drat    wt  qsec vs am
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1
gear carb
Mazda RX4        4    4
Mazda RX4 Wag    4    4
``````

However, what if you want “all automatic cars” or “all cars with mpg > 20”?

### Logical operators on a vector

Just like arithmetic operations, logical operators on a vector apply the test to each element individually:

``````v = c(1, 3, 12, 5, 2, 20)
v > 4
``````
`````` FALSE FALSE  TRUE  TRUE FALSE  TRUE
``````

### Compound logical operators on a vector

You can combine them using `&` (and) or `|` (or):

``````v > 4 & v < 15
``````
`````` FALSE FALSE  TRUE  TRUE FALSE FALSE
``````
``````v < 6 | v > 15
``````
``````  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
``````

### Logical operations on a column

This can equally easily be applied to a column of `mtcars`:

``````mtcars\$mpg > 20
``````
``````   TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
``````

### Filtering a data frame logically

This logical vector can be used to subset rows of the data frame- `TRUE` means “keep the row”, `FALSE` means drop it. Place it before the comma in the square brackets:

``````v = mtcars\$mpg > 20
efficient.cars = mtcars[v, ]
``````

or just:

``````efficient.cars = mtcars[mtcars\$mpg > 20, ]
``````

### Filtering on multiple conditions

You can combine multiple conditions using `&` (and) or `|` (or), such as looking for automatic gearshift cars with mpg > 20:

``````efficient.auto = mtcars[mtcars\$mpg > 20 & mtcars\$am == 0, ]
``````
``````                mpg cyl  disp  hp drat    wt  qsec vs
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1
am gear carb
Hornet 4 Drive  0    3    1
Merc 240D       0    4    2
Merc 230        0    4    2
``````

## data.table

`data.table` is a third-party package that improves in many ways on the built-in `data.frame`.

We'll go over some of its advantages on Wednesday and Friday, but will focus on one- how it makes filtering more convenient- today.

### Turn a data.frame into a data.table

Since `data.table` is a third-party package, you need to install it first. Once it is installed, you still have to load it into R:

``````library("data.table")
``````

(You'll have to re-do that line each time you reopen R). Then convert your data.frame to a data.table:

``````mtcars.dt = as.data.table(mtcars)
``````

### Filtering a data.table

A `data.table` looks identical in many ways to a `data.frame`, but has some useful features. One is that when you're filtering, you don't need to say `mtcars\$` each time when you're in the brackets- you can just refer to the column names:

``````mtcars.dt[mpg > 20 & am == 0, ]
``````
``````    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
``````

This doesn't mean the `mpg` and `am` variables exist: they exist only within those square brackets.