A data scientist has to work with many different types of data storage and there are chances when you need to pull data from enterprise data warehouse into your analysis environment.
We can write SQL queries to retrieve data from database tables and write the data to a local CSV file. Read the local CSV file into R or Python in the analysis environment.
Alternatively, if you work with the R environment, RMarkdown supports database connection. By using RMySQL, with the connection, we extract data directly from a remote database sever. RMarkdown
is a documentation tool which enables reproducibility of data analysis, and help data scientists turn their analysis and findings into high quality documents.
The screen recording shows in R Markdown, how to execute SQL commands, use either R or SQL chunk in R Markdown, write the resulting set to data frame, and perform exploratory data analysis and visualization.
- Connect to MySQL
The connection data should not be embedded in analysis code. Separate the connection code in another script. The script should set up the connection and save it into the workspace. The saved connection is accessible by its name in the analysis code.
library(RMySQL)
jsea <- dbConnect(RMySQL::MySQL(),dbname='jsea',username='username',password='pwd', host='dbsever',port=3306)
knitr::opts_chunk$set(connection = "jsea")
- Read Data From Relational Table
After a connection has been established, dbReadTable can read a table into a data frame.
library(tidyverse)
vehicles <- as_tibble(DBI::dbReadTable(jsea,"vehicles"))
vehicles %>% slice(sample(1:10000,5))
- Run SQL in R chunk
- Run dplyr and ggplot
Once you have the table in a tibble, you can easily tidy the data and explore it by using the fantastic tidyverse
package.
frequency <- vehicles %>% count(make) %>% arrange(desc(n))
frequency$make <-
factor(frequency$make,level=frequency$make[order(frequency$n)])
frequency %>%
slice(1:30) %>%
ggplot(aes(x=make,y=n))+
geom_bar(stat="identity", fill="#87cefa") +
coord_flip() +
xlab("") +
ylab("") +
theme_light()
- Write New Data to Database Table
DBI::dbWriteTable(conn=jsea, name="iris", value=iris, row.names=FALSE, overwrite=TRUE, append=FALSE)
- Run SQL in SQL chunk
We also can run SQL in a code chunk of type sql
. By setting the connection in the code chuck and adding the option output.var
, the resulting table from the SQL is written into a variable in R. Access the report with the name report_model_by_make
.
```{sql connection=jsea, output.var="report_model_by_make"}
SELECT make, count(*) as num_of_models
FROM vehicles
GROUP BY make
ORDER BY num_of_models DESC
```
In the end, disconnect from Database with DBI::dbDisconnect().