Data Manipulation: SQL Vs R

October 2, 2016
by kunalgera
0
5745
- Tags:
- Air Quality
- Cran
- data frame
- R libraries
- R packages
- SQL in R
- SQL library
- transformation

Kunal Gera - Certified in Business and Data Analytics By: Kunal Gera

Introduction using SQL in R

The two part of it,

Using data frame as table (as database table)
- Using base function (other library ‘dplyr’)
- Using ‘sqldf’
Connecting with actual database

I will be explaining the former one.

Using data frame as table

SQL data manipulation task

Select with ‘where’, ‘group by’ and ‘order by’ clause
Count with ‘where’ clause
Join
Update
Addition of column
Applying function

A.Using base function

[I will be giving code snippets of the above mentioned function in ‘base’]

Increasing efficiency of base code using ‘data.table’ library. This library is also

There is other library in R for DML, it is ‘dplyr’. [I will not be explaining it]

B.Introduction to SQL Library in R (‘sqldf’)

[I will be giving code snippets of the above mentioned function in ‘sqldf’]

SQL in R

Data! Data! Data, here, there and now everywhere. We are generating data from most of the activity we perform these days. From buying grocery to buying a car, from sleeping with an alarm to gyming.

To handle this amount of data we need tools to manipulate it efficiently. As we can’t consume raw food similarly we can’t consume raw data. We need to cook it, and for that we have a great tools.

This article is specifically very useful for learners who know SQL and want to explore R.

Prerequisite:

Skills

Knowledge of SQL
Basic knowledge of R and data frames (You can think ‘data frames’ as ‘table’ in a database).

Software

Any R IDE.
Any Database (For connecting to real database)

Introduction to SQL in R

SQL is a very powerful language for talking to any relational database. This power can be multiplied when clubbed with R. R is a statistical software which is highly used in data mining and analytics industry. As it is an open source software, it turns to be a great choice for many.

R is been used in various task ranging from analytical model building, visualising the data to even report generation. To start in R we need to learn few basic of data manipulation, as a SQL user you must have performed many query based manipulation. Let’s now learn how we can do the same in R.

Today, we will be focusing on how we can make R talk and understand SQL and getting the same result as we get from SQL in database.

We will be learn most used data manipulation techniques, for that we will cover following topics:

S. No.	SQL	R
A	Where clause	Subset
B	Group by clause	Aggregate
C	Order by	Sort & Aggrange
D	Transforming a column	Mutate & tranformation
E	Distinct	Duplicated
F	Join	Merge

Making R behave and understand SQL

We will be applying function and queries on dataframes in R

Base & ‘dplyr’ Function (Using R functions to give output similar to SQL)
Sqldf package (Using SQL queries on dataframe)

Making R speak SQL

Connect R with actual database

Let’s start cooking

For cooking the data we will need two things:

A. R Libraries (spice)

Base, ‘dplyr’ and Sqldf’ (SQLite queries)

Base is available
library(dplyr)
library(sqldf)

We will be learning the functions in each library to get the same output.

B. Data (raw food)

For today’s learning we will be using ‘airquality’ dataset in R, which we can load the dataset by following command:

Loading the dataset

data(airquality)

Let’s now understand (seeing the datatype, base statistics and first few rows) about the dataset by using the important commands in R

Understanding the dataset

str() can used to know the datatype and few observation of any dataframe.

str(airquality) str-airquality Similarly, summary() can used to know the statistical details about the observation of any dataframe. Eg: Mean, Median, Quartiles and frequency and etc.

summary(airquality) summary-airquality head() is an easy and quick way to look few row of any dataframe.

head(airquality) head-airquality dim() is a function to know the dimension any dataframe i.e. count of rows and column.

dim(airquality)
[1] 153 6

Quick Tip: You can write ‘?functionname‘ in R console to know more about that function. E.g. ‘dim‘ will show the documentation on ‘dim()’ function.

So, airquality is data of daily air quality measurements in New York, from May to September 1973.

Ozone numeric Ozone (ppb)

Solar.R numeric Solar R (lang)

Wind numeric Wind (mph)

Temp numeric Temperature (degrees F)

Month numeric Month (1–12)

Day numeric Day of month (1–31)

Hope you guys have understood the ‘airquality’ dataset, it not that tough though.

So now, we know our data set so we can start cooking our dataset with the above mentioned technique and functions.

A. Subset

It is simple to understand and most useful technique in data manipulation, i.e. subset: which is filtering the data set based of some condition or clause.

Base

“Subset in BASE“: subset(airquality, Month == 5)

airquality[airquality$Month == 5,]

Selecting specific columns: So, if we want Ozone,Solar.R, Month and Temp from airqualit then following command can be used:

airquality[,c(“Ozone”,”Solar.R”,“Month”,”Temp”)]

dplyr

“filter in dplyr”: filter(airquality,Month == 5)

“Selecting specific columns”:

select(airquality,Ozone,Solar.R,Month,Temp)

Note: We can use pipes(%>%) to make it one command

airquality %>% filter(Month==5) %>% select(Ozone,Solar.R,Month,Temp)

Pipe consider the left side dataframe for right side function which is similar to SQL.

sqldf

“Where clause in sqldf”: sqldf(‘Select * from airquality where Month = 5; ‘)

“Selecting specific columns “: sqldf(‘Select Ozone,Solar.R,Month,Temp from airquality where Month = 5; ‘) number-of-evaluations
For subset in R, “base” is preferred as it is laconic and time efficient.

B. Aggregate

It is highly used technique in data manipulation, summarising or aggregating (i.e. sum, mean, standard deviation etc.) the data on a grouping variable.

Here, we will find the sum of Temp and average of Wind grouped by Month.

Base

aggregate(airquality[,c(TotalTemp = “Temp”,”Wind”)], list(“Month” = Month),mean)

aggregate(cbind(“AverageTemp” = airquality$Temp, “AverageWind” = airquality$Wind) ~ Month, data = airquality, FUN=mean)

Note: We can’t apply two or more aggregate functions on different columns, but we can apply one function to all the columns. So we have to re-apply the function accordingly.

dplyr

airquality %>% group_by(Month) %>%

summarise(“TotalTemp” = sum(Temp), “AverageWind” =

mean(Wind,2))

sqldf

sqldf(“select Month, sum(Temp) as TotalTemp, avg(Wind) as AverageWind from airquality group by Month;”)
For aggregation, the choice is dplyr as it allows use of multiple aggregate function simultaneously, ease to understand and time-efficient.

C. Sort

Arranging or sort is a normal task used on various occasion in data manipulation, here we are sorting the data by ascending order of Month and descending order of Day.

Base

airquality_sorted_base <- airquality[order(airquality$Month,-airquality$Day),]

dplyr

airquality %>% arrange(Month,desc(Day))

sqldf

sqldf(“select * from airquality order by Month,Day desc;”) sorting-data
Usually the sorting data is combine with other data manipulation techniques, so “dplyr” become the first choice because it’s laconic and it can be clubbed using pipes.

D. Formulate a new column (transforming columns)

In manipulation of data, this may be an important task we need to perform sometimes. Assuming, we want to find out the Temp as percentage of total temperature of all months.

Base

airquality$TempPercentage <- airquality$Temp/(sum(airquality$Temp))

dplyr

airquality <- mutate(airquality, TempPercentage = Temp/sum(Temp))

sqldf

sqldf(“select aq.*,cast(Temp as real)/(select sum(Temp) from airquality) as TempPercentage from airquality aq;”) transformation-data
For transformation, “dpylr” is preferred as you can transform multiple column in single line of code but if you have only one or two columns you can use “base”.

E. Removing Duplicate (Single & multiple columns)

For removing duplicates you need a dataset with duplicate observation, for this we can use rbind() function to duplicate the current dataset by following command.

airquality_duplicated <- rbind(airquality,airquality)

Base

airquality_duplicated[!duplicated(airquality_duplicated[,c(“Month”,”Day”)]),]

dplyr

distinct_aq_dplyr <- airquality_duplicated %>% distinct(Month,Day)

sqldf

sqldf(‘select * from airquality_duplicated

where rowid in

(select min(rowid) from airquality_duplicated group by Month, Day)

;’) duplicate-removal
For removing duplicate, clearing “dpylr” is the choice because it’s laconic and easy to understand.

F. Join

Now, assuming we want to add “Month” name in the dataset, we can do this by various method, one of which is using join or merging technique. For this, just create a file named “monthnames.csv” with the below values and import the file in R environment as ‘monthnames’ (dataframe)

“1 Jan”
“2 Feb”
“3 Mar”
“4 Apr”
“5 May”
“6 Jun”
“7 Jul”
“8 Aug”
“9 Sep”
“10 Oct”
“11 Nov”
“12 Dec”

To import the file, save the above file in your working directory and run the following command

monthnames <- read.csv(file = “monthnames.csv”, header=FALSE)

names(monthnames) = c(“Month”,”MonthName”)

Base

airquality_join_base <- merge(x = airquality, y = monthnames, by.x = “Month”, by.y = “Month”, x.all=TRUE)

Similarly,

Outer join: merge(x = airquality, y = monthnames, by.x = “Month”, by.y = “Month”, all = TRUE)

Left outer: merge(x = airquality, y = monthnames, by.x = “Month”, by.y = “Month”, all.x = TRUE)

Right outer: merge(x = airquality, y = monthnames, by.x = “Month”, by.y = “Month”, all.y = TRUE)

dplyr

airquality_join_dplyr <- inner_join(x = airquality, y = monthnames, by = c(“Month” = “Month”))

Similarly,

Outer join: full_join(x = airquality, y = monthnames, by = c(“Month” = “Month”))

Left outer: left_join(x = airquality, y = monthnames, by = c(“Month” = “Month”))

Right outer: right_join(x = airquality, y = monthnames, by = c(“Month” = “Month”))

sqldf

airquality_sqldf <- sqldf(‘select aq.*,mn.MonthName from airquality aq

inner join monthnames mn

on aq.Month = mn.Month; ‘)

Similarly,

Left join: sqldf(‘select aq.*, mn.MonthName from airquality aq

left join monthnames mn

on aq.Month = mn.Month;’)

Note: RIGHT and FULL OUTER JOINs are not currently supported in sqldf joins-data
For joins, “dplyr” is the clearly the best as it’s laconic, time-efficient and easy to understand.

Conclusion

S. No	Topic	Utility	BASE	dplyr	sqldf
A	Subset	Very frequent	1	2	3
B	Aggregate	Frequent	2	1	3
C	Sort	Very frequent	1	2	3
D	Transform	Less frequent	1	2	3
E	Duplicate	Very frequent	1	1	3
F	Join	Frequent	2	1	3
Total (Least the better)			8	9	18

*Least the better

There is a minute difference in time for ‘base’ and ‘dplyr’, though ‘base’ is a clear winner but I suggest to choice a ‘dplyr’ package based on following reasons:

Substantially efficient on big size dataset
Capable of both in and out memory operation (i.e. dataframe and connect with database).
Laconic and ease to understand.

Connect R with actual database using ‘RODBC’

Step a) Create DSN Name for your database.

I will not be concentrating on this a lot on this, but at macros level, I will explain that it requires ODBC driver specific to database. You can google to find “How to setup ODBC connection”.

You can follow these articles to setup DSN name:

Step b) Connecting Database through R.

To connect any database through R we need a library called ‘RODBC’

library(RODBC)

i) Creating a connection with database, for that we will use odbcConnect() function in RODBC library.

db <- odbcConnect(dsn=”TEST “, uid=”USERNAME”, pwd=”PASSWORD”, believeNRows=FALSE)

Note: In oracle the username and password will be specific to the schema you want to connect.

ii) Importing data using SQL query through database connection, we will use sqlQuery() function in RODBC library.

DataFrame = sqlQuery(db, “Select * from TABLE”)

This have create a ‘dataframe’ with all observation in TABLE. You can now use the connection to perform any SQL query and import any table.

It is very powerful as the queries are performed by SQL in database and not R, but remember the data is getting transferred, between R and Database, every time you execute the query.

You can learn more on the above topics:

http://cran.r-project.org/web/packages/dplyr/

http://cran.r-project.org/web/packages/sqldf/

http://cran.r-project.org/web/packages/RODBC/

Author: Kunal Gera

Profile: Kunal has a strong background of finance and banking acumen.

He has been involved in requirement gathering and solving problem with the help of data analysis. He is well versed with various analytical concepts and have been using techniques like Logistic Regression, Tree, CART Model, Linear optimization, Integer optimization etc in his career.