noctua
would return a data.frame
for utility SQL
queries regardless of backend file parser. This is due to AWS Athena
outputting SQL UTILITY
queries as a text file that required to be read in line by line. Now noctua
will return the correct data format based on file parser set in noctua_options
for example: noctua_options("vroom")
will return tibbles
.dbClearResult
when user doesn’t have permission to delete AWS S3 objects (#96)noctua_options
contains 2 new parameters to control how noctua
handles retries.dbFetch
is able to return data from AWS Athena in chunk. This has been achieved by passing NextToken
to AthenaResult
s4 class. This method won’t be as fast n = -1
as each chunk will have to be process into data frame format.library(DBI)
con <- dbConnect(noctua::athena())
res <- dbExecute(con, "select * from some_big_table limit 10000")
dbFetch(res, 5000)
dbWriteTable
opts to use alter table
instead of standard msck repair table
. This is to improve performance when appending to tables with high number of existing partitions.dbWriteTable
now allows json to be appended to json ddls created with the Openx-JsonSerDe library.dbConvertTable
brings dplyr::compute
functionality to base package, allowing noctua
to use the power of AWS Athena to convert tables and queries to more efficient file formats in AWS S3 (RAthena: # 37).dplyr::compute
to give same functionality of dbConvertTable
region_name
check before making a connection to AWS Athena (RAthena: # 110)dbWriteTable
would throw throttling error
every now and again, retry_api_call
as been built to handle the parsing of data between R and AWS S3.dbWriteTable
did not clear down all metadata when uploading to AWS Athena
dbWriteTable
added support ddl structures for user who have created ddl’s outside of noctua
noctua
retry functionality\dontrun
(#91)pyathena
, noctua_options
now has a new paramter cache_size
. This implements local caching in R environments instead of using AWS list_query_executions
. This is down to dbClearResult
clearing S3’s Athena output when caching isn’t disablednoctua_options
now has clear_cache
parameter to clear down all cached data.dbRemoveTable
now utilise AWS Glue
to remove tables from AWS Glue
catalog. This has a performance enhancement:library(DBI)
con = dbConnect(noctua::athena())
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.247 0.091 2.243
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.110 0.045 1.094
dbWriteTable
now supports uploading json lines (http://jsonlines.org/) format up to AWS Athena
(#88).library(DBI)
con = dbConnect(RAthena::athena())
dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")
dbConnect
didn’t correct pass .internal
metadata for paws objects.computeHostName
& computeDisplayName
now get region name from info
object from dbConnect
S4 class.dbWriteTable
appending to existing table compress file type was incorrectly return.Rstudio connection tab
comes into an issue when Glue Table isn’t stored correctly (RAthena: # 92)AWS_REGION
into dbConnect
writeBin
: Only 2^31 - 1 bytes can be written in a single call (and that is the maximum capacity of a raw vector on 32-bit platforms). This means that it will error out with large raw connections. To over come this writeBin
can be called in chunks. If readr
is available on system then readr::write_file
is used for extra speed.library(readr)
library(microbenchmark)
# creating some dummy data for testing
X <- 1e8
df <-
data.frame(
w = runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
write_csv(df, "test.csv")
# read in text file into raw format
obj <- readBin("test.csv", what = "raw", n = file.size("test.csv"))
format(object.size(obj), units = "auto")
# 3.3 Gb
# writeBin in a loop
write_bin <- function(
value,
filename,
chunk_size = 2L ^ 20L) {
total_size <- length(value)
split_vec <- seq(1, total_size, chunk_size)
con <- file(filename, "a+b")
on.exit(close(con))
sapply(split_vec, function(x){writeBin(value[x:min(total_size,(x+chunk_size-1))],con)})
invisible(TRUE)
}
microbenchmark(writeBin_loop = write_bin(obj, tempfile()),
readr = write_file(obj, tempfile()),
times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# R_loop 41.463273 41.62077 42.265778 41.908908 42.022042 44.313893 5
# readr 2.291571 2.40495 2.496871 2.542544 2.558367 2.686921 5
sql_translate_env
(RAthena: # 44)# Before
translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'
fwrite
(>=1.12.4) https://github.com/Rdatatable/data.table/blob/master/NEWS.mdpaste
/paste0
would use default dplyr:sql-translate-env
(concat_ws
). paste0
now uses Presto’s concat
function and paste
now uses pipes to get extra flexibility for custom separating values.append
set to TRUE
then existing s3.location will be utilised (RAthena: # 73)db_compute
returned table name, however when a user wished to write table to another location (RAthena: # 74). An error would be raised: Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not exist
This has now been fixed with db_compute returning dbplyr::in_schema
.library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
tbl(con, "iris") %>%
compute(name = "temp.iris")
dbListFields
didn’t display partitioned columns. This has now been fixed with the call to AWS Glue being altered to include more metadata allowing for column names and partitions to be returned.dbListFields
dbStatistics
is a wrapper around paws
get_query_execution
to return statistics for noctua::dbSendQuery
resultsdbGetQuery
has new parameter statistics
to print out dbStatistics
before returning Athena results.noctua_options
vroom
has been restricted to >= 1.2.0 due to integer64 support and changes to vroom
apidplyr::tbl
when calling Athena when using the ident method (#64):library(DBI)
library(dplyr)
con <- dbConnect(noctua::athena())
# ident method:
t1 <- system.time(tbl(con, "iris"))
# sub query method:
t2 <- system.time(tbl(con, sql("select * from iris")))
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660
dplyr
sql_translate_env: expected results have now been updated to take into account bug fix with date fieldsdata.table
to vroom
. From now on it is possible to change file parser using noctua_options
for example:dbGetTables
that returns Athena hierarchy as a data.framevroom
roxygen2
7.0.2dbWriteTable
append
parameter checks and uses existing AWS Athena DDL file type. If file.type
doesn’t match Athena DDL file type then user will receive a warning message:warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)
INTEGER
being incorrectly translated in sql_translate_env.R
as.character
was getting wrongly translated (RAthena: # 45)data-transfer
dbRemoveTable
new parameters are added in unit testsql_translate_env
until test to cater bug fixdbRemoveTable
can now remove S3 files for AWS Athena table being removed.tolower
conversion due to request (RAthena: # 41)dbWriteTable
now will split gzip
compressed files to improve AWS Athena performance. By default gzip
compressed files will be split into 20.Performance results
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
# upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splits
AWS Athena performance results from AWS console (query executed: select count(*) from ....
):
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(noctua::athena())
dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.
Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
copy_to
method now supports compress and max_batch, to align with dbWriteTable
dbWriteTable
POSIXct
to Athena. This class was convert incorrectly and AWS Athena would return NA instead. noctua
will now correctly convert POSIXct
to timestamp but will also correct read in timestamp into POSIXct
NA
in string format. Before noctua
would return NA
in string class as ""
this has now been fixed.noctua
would translate output into a vector with current the method dbFetch
n = 0.sql_translate_env
. Previously noctua
would take the default dplyr::sql_translate_env
, now noctua
has a custom method that uses Data types from: https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window functions from: https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.htmlPOSIXct
class has now been added to data transfer unit testdplyr sql_translate_env
tests if R functions are correct translated in to Athena sql
syntax.dbWriteTable
is called. The bug is due to function sqlCreateTable
which dbWriteTable
calls. Parameters table
and fields
were set to NULL
. This has now been fixed.s3.location
parameter is dbWriteTable
can now be made nullableupload_data
has been rebuilt and removed the old “horrible” if statement with paste
now the function relies on sprintf
to construct the s3 location path. This method now is a lot clearer in how the s3 location is created plus it enables a dbWriteTable
to be simplified. dbWriteTable
can now upload data to the default s3_staging directory created in dbConnect
this simplifies dbWriteTable
to :data transfer
test now tests compress, and default s3.location when transferring datadbWriteTable
sqlCreateTable
info message will now only inform user if colnames have changed and display the column name that have changedconfig = list()
parameter is paws
objectsBigInt
are now passed correctly into integer64
AthenaResult
returned: Error in call[[2]] : object of type 'closure' is not subsettable
. The function do.call
was causing the issue, to address this do.call
has been removed and the helper function request
has been broken down into ResultConfiguration
to return a single component of start_query_execution
do.call
have been broken down due to error: Error in call[[2]] : object of type 'closure' is not subsettable
bigint
to integer64
in data.transfer unit testpaws
version has been set to a minimum of 0.1.5
due to latest change.data.table
is now used as the default file parser data.table::fread
/ data.table::fwrite
. This isn’t a breaking change as data.table
was used before however this change makes data.table
to default file parser.dbConnect
method can use the following methods:
arrow
packageassume_role
developed method for user to assume role when connecting to AWS Athena