Combine CSV files in RStudio in 5 Easy Steps
Combine CSV files in RStudio in 5 Easy Steps
Unfortunately, with the latest update to RStudio, RWordPress has not yet been updated. In the meantime, manually posting RStudio code on how to easily combine two files in R in a few easy steps:
- Step 1: Read in the first file, update the data types for several columns
- Change NA’s to 0
- Delete extra columns
- Step 2: Read in the second file
- Step 3: Merge the two files
- Delete extra columns with some reformatting
- Step 4: Add a new column to calculate number of years between two dates
- Step 5: Export file to csv and/or Excel with the today’s date
Code from RStudio:
title: “R Notebook – combine files”
output: html_document:
“`{r}
pkgs <- c(‘tidyverse’, ‘dplyr’, ‘caret’, ‘psych’ ,’readxl’, ‘scales’, ‘psych’, ‘dplyr’, ‘plyr’, ‘tidyverse’)
invisible(lapply(pkgs, require, character.only = T))
Step 1: Read in csv file
df <- read.csv(“D:/expiringnames/promo_analysis/trx_to_process/fashion_2019/fashion_trx_2019.csv”, header=TRUE, stringsAsFactors=FALSE)
view(df)
#df$Transaction <- as.Date(df$Transaction, “%Y/%m/%d %H%M”)
df <- df %>% mutate_at(vars(OrigAmt, NetAmt, Discounts, RevShare, RevShare.1, SubTotal), as.numeric)
df <- df %>% mutate_at(vars(Effective, Expiration, Billable, Transaction, InvoiceDate), as.Date, format=”%Y-%m-%d”)
tidyr::replace_na(df, list(OrigAmt=0, NetAmt=0))
#change NAs to 0
df$OrigAmt[is.na(df$OrigAmt)] <- 0
df$Discounts[is.na(df$Discounts)] <- 0
df$SubTotal[is.na(df$SubTotal)] <- 0
df$RevShare[is.na(df$RevShare)] <- 0
df$RevShare.1[is.na(df$RevShare.1)] <- 0
str(df)
#combine two columns into one
df$RevShare <- df$RevShare + df$RevShare.1
#remove extra columns
df <- subset(df, select = -c(REGREB, RWLREB, RWAREB, CVC, RevShare.1))
tail(df[,1] ) #show tail in 1st column
Step 2: Read in the file to append
df2 <- read.csv(“D:/names/analysis/trx_2019.csv”, header=TRUE, stringsAsFactors=FALSE)
view(df2)
#remove extra columns
df2 <- subset(df2, select = -c(orig_amt, discounts, NetAmt))
#check type of strings and change types as necessary
str(df2[3:25])
Step 3: Merge the two datasets – horizontally/columns
df3 <- cbind(df, df2)
str(df3)
#eliminate columns not needed
df4 <- subset(df3, select = -c(Invoice, Billable, InvoiceDate, possible_sp_error, WordsPresent, WordsNotPresent, Status))
view(df4)
#reformat percentage values
library (‘scales’)
scales::percent(df4$Percentage/100)
dim(df4)
–> result [1] 8612 35 (reflecting the addition of the columns as originally started w/ 25)
Step 4: Calculate years
#To determine the length of the service period – subtract two dates & put into years
df4$term <- as.numeric(difftime(df4$Expiration, df4$Effective), unit=”weeks”)/52.25
df4$term
library (lubridate, plyr)
int <- interval(df4$Effective, df4$Expiration)
time_length (int, “year”) #exact
df4$term <- trunc(time_length(int, “year”)) #truncated version to remove decimal places due to small portions i.e. leap years
df4$term
Step 5: Export file with current date
date <- Sys.Date()
#export as .csv
csvfilename <- paste(“D:/df4_file_”, date,”.csv”,sep=””)
write.csv(df4, file=csvfilename, row.names=FALSE) #row.names being false avoids extra column ref
#export as a excel
library (openxlsx)
xlfilename <- paste(“D:/df4_file_”, date,”.xlsx”)
write.xlsx (df4, file=xlfilename, row.names=FALSE)
Want to combine csv files in Python instead? See
date <- Sys.Date()
#export as .csv
csvfilename <- paste(“D:/df4_file_”, date,”.csv”,sep=””)
write.csv(df4, file=csvfilename, row.names=FALSE) #row.names being false avoids extra column ref
#export as a excel
library (openxlsx)
xlfilename <- paste(“D:/df4_file_”, date,”.xlsx”)
write.xlsx (df4, file=xlfilename, row.names=FALSE)
Want to combine csv files in Python instead? See