The data.frame object in R groups a number of column vectors into a data set in R. The way data.frame organizes data is similar to that of a spreadsheet, a 2D frame. Tibble is a modern version of classical data.frame which is used in some of R packages. A data.frame is constrained to only hold named columns of the same length.
data.frame is included in the R base. The same data structure is implemented in Python with the module Pandas
A data.frame is like an Excel spreadsheet on the surface with columns and rows. Statistically, each column is a variable and ech row is an observation. In the data mining terms, each column is an attribute and each row is an instance. In the machine learning terms, each column is a feature and each row is an object.
Creating data.frame
A data.frame object is created by specifying a set of named vectors to the data.frame function. For example, create a data.frame containing Chicago temperature forecasts over the next five days:
Day | Date | TempF |
---|---|---|
Thursday | Feb 1 | 26 |
Friday | Feb 2 | 22 |
Saturday | Feb 3 | 30 |
Sunday | Feb 4 | 32 |
Monday | Feb 5 | 24 |
Or creating named columns and combing them:
Cautions
Nonmatching Vector Lengths
If attempting to create a data frame using vectors with nonmatching lengths, R will print an error message. For example, the following command will produce the error message:
Error in data.frame(x = 1:5, y = 1:2) : arguments imply differing number of rows: 5, 2
Encode String Input to Factor
R likes to encode an input string vector to a factor unless we turn off the default setting. The str
command displays the type of each column in chicWeather
.
str(chicWeather)
## 'data.frame': 5 obs. of 3 variables:
## $ Day : Factor w/ 5 levels "Friday","Monday",..: 5 1 3 4 2
## $ Date : Factor w/ 5 levels "Feb 1","Feb 2",..: 1 2 3 4 5
## $ TempF: num 26 22 30 32 24
Both Day and Date are of the factor type. To prevent this conversion from character to factor, add stringsAsFactors=FALSE
to data.frame
:
EX1
Write R code to create the following data frame of 5 rows and 5 named columns, store the data in a name weather
, and print the data.
outlook | temperature | humidity | windy | play |
---|---|---|---|---|
sunny | 85 | 85 | FALSE | no |
sunny | 80 | 90 | TRUE | no |
overcast | 83 | 86 | FALSE | yes |
rainy | 70 | 96 | FALSE | yes |
rainy | 68 | 80 | FALSE | yes |
Inspecting the Data
It is useful to look at a few R base functions which help quickly understand the data stored in a data frame as well as each column for its data type and sample values. Run the following commands on a data.frame.
class()
nrow(weather) # row count in the data
ncol(weather)
colnames()
rownames()
dim(weather) # dimension of the data
dim(weather)[2] # column count in the data
str(weather) # View the structure of the data
head(weather) # return the first few rows of the data
tail(weather) # return the last few rows of the data
summary(weather) # statistics data of each column
The View of Data.frame
Because R views a data.frame as simply a named list of column vectors, each element of a data frame is a column vector. Therefore,
- The
length
function returns the number of column vectors - The
names
function returns the element (column) names.
The following example shows the outputs of the two functions:
length(weather)
## [1] 5
names(weather)
## [1] "outlook" "temperature" "humidity" "windy" "play"
EX2
R contains many built-in datasets in the base package datasets
. Check whether the package datasets
is available in the library. You should have the package as it comes with R installation. In any case when the package is not available, install the pacakge. To import a built-in dataset, e.g., iris
, simply type the name of dataset and run.
iris
Open the help doc for built-in dataset
Each R built-in dataset comes with a help document, explaining the values inside. To open the help document, run the command help.
help(iris)
Inspect the object
Run the commands in the Inspecting the Data on iris.
Find data type of a column (attribute)
To find the type of a vector, e.g., 'Species' in iris
, run the command:
class(iris$Species)
The result shows the column is of type factor
. A factor stores categories and enumerated values.
Write the code to find the type of the column `Sepal.Length` in `iris`.
Selecting one column
As with data.frame, we can reference a single element (vector) from the data frame using either style:
- Double square brackets
- The $ sign with column name
- Subscripting as in a numerical matrix, with square brackets
chicWeather[[3]] # double squared brackets
## [1] 26 22 30 32 24
chicWeather$TempF # the $ symbol
## [1] 26 22 30 32 24
chicWeather[,3] # subscripting
## [1] 26 22 30 32 24
chicWeather[,"TempF"]
## [1] 26 22 30 32 24
Subscripting data.frame like Matrix
R allows us to reference the data frame as if it was a matrix. We can filter rows and columns in a data.frame by the same subscripting methods for math matrices. For example:
chicWeather[,3]
chicWeather[1:3,]
chicWeather[1:2,c(1,3)]
chicWeather[1:2,c("Day","TempF")]
Logical subscripting a single column
Besides, a logcial expression can filter values in a single column by only returning the values which evaluate TRUE for a given criterion. Here is an expression, only returning values in TempF which is higher than 25.
chicWeather$TempF[ chicWeather$TempF > 25 ] # logical subscript
## [1] 26 30 32
Logical subscripting for filtering rows
Test a logical expression rowwise. Only choose the rows which satisfy the criteria. The following code returns a subset containing only rows with temperatures higher than 25F.
chicWeather[ chicWeather$TempF > 25, ]
## Day Date TempF
## 1 Thursday Feb 1 26
## 3 Saturday Feb 3 30
## 4 Sunday Feb 4 32
The following code shows subsetting for days when temperature is 22F.
chicWeather[ chicWeather$TempF == 22, ]
## Day Date TempF
## 2 Friday Feb 2 22
Logical subscripting (subsetting) both rows and columns
If the columns need to be filtered too, a vector of names or indexes is added as the second argument.
chicWeather[ chicWeather$TempF > 25, c("Day", "TempF") ]
## Day TempF
## 1 Thursday 26
## 3 Saturday 30
## 4 Sunday 32
EX3
Retrieve a column from iris
As the columns in iris
are named, the $
symbol provides an intutive way of referencing named columns. For example, to retrieve the column named Species
, run the following expression and it will return a vector which contains the request column.
iris$Species
Write the code to retrieve the column Petal.Length
in the iris
dataset.
Run the following code. Describe the values in set1 and set2.
df <- data.frame(X = -2:2, Y = 1:5)
set1 <- df$Y[ df$X > 0 ]
set2 <- df[ df$X > 0, ]
Subsetting iris data
Write R code to retrieve the following subsets from iris
:
- The first 50 rows
- The first 2 columns
- The columns
Sepal.Length
andPetal.Length
- All of the columns excluding the last column
Species
- The rows whose
Species
equals to'setosa'
- The rows whose
Species
is not'setosa'
Adding New Columns to data.frame
For example, add a new column named TempC
to chicWeather, containing the temperature in degrees Celsius:
TempC <- round((chicWeather$TempF - 32) * 5/9)
chicWeather$TempC <- TempC
print(chicWeather)
## Day Date TempF TempC
## 1 Thursday Feb 1 26 -3
## 2 Friday Feb 2 22 -6
## 3 Saturday Feb 3 30 -1
## 4 Sunday Feb 4 32 0
## 5 Monday Feb 5 24 -4
Or use data.frame
command
Humidity <- c(2,1,8,5,4)
chicWeather <- data.frame(chicWeather, Humidity)
print(chicWeather)
## Day Date TempF TempC Humidity
## 1 Thursday Feb 1 26 -3 2
## 2 Friday Feb 2 22 -6 1
## 3 Saturday Feb 3 30 -1 8
## 4 Sunday Feb 4 32 0 5
## 5 Monday Feb 5 24 -4 4
EX4
Load the built-in mtcars dataset. Read the help doc of mtcars to understand the origin of the data. Use mtcars to:
- Print only the first five rows.
- Print the last five rows.
- How many rows and columns does the data have?
- Look at the data in the RStudio data viewer (if you are using RStudio).
- Print the
mpg
column of the data. - Print the
mpg
column of the data where the correspondingcyl
column is 6. - Print all rows of the data where
cyl
is 6. - Print all rows of the data where
mpg
is greater than 25, but only for thempg
andcyl
columns.
EX5
Install ggplot2 package. ggplot2 contains the diamonds dataset. Load the diamonds data.
- Install
ggplot2
package.ggplot2
contains thediamonds
dataset. - Import the
ggplot2
package. Load thediamonds
data. - Run the command
?diamonds
. The help page will open under the Help tab. Read the document to understand the origin of the data and its attributes. - Print the first five rows
- Print the row count and column count
- Select rows whose cut equals to
Very Good
. And find the total of rows in the returned subset - Find how many diamonds whose carat is greater than 3.0
- Return rows where color is D, but only for the color and cut columns
- Run the
summary
command with the diamonds data. Read the average price.
Share this post
Twitter
Facebook
LinkedIn
Email