R has been interrupt a new parameter has been added to dbConnect, keyboard_interrupt. Example:# Stop AWS Athena when R has been interrupted:
con <- dbConnect(RAthena::athena())
# Let AWS Athena keep running when R has been interrupted:
con <- dbConnect(RAthena::athena(),
keyboard_interrupt = F)RAthena would return a data.frame for utility SQL queries regardless of backend file parser. This is due to AWS Athena outputting SQL UTILITY queries as a text file that required to be read in line by line. Now RAthena will return the correct data format based on file parser set in RAthena_options for example: RAthena_options("vroom") will return tibbles.dbClearResult when user doesn’t have permission to delete AWS S3 objects ([noctua: # 96]*https://github.com/DyfanJones/noctua/issues/96))RAthena_options contains 2 new parameters to control how RAthena handles retries.dbFetch is able to return data from AWS Athena in chunk. This has been achieved by passing NextToken to AthenaResult s4 class. This method won’t be as fast n = -1 as each chunk will have to be process into data frame format.library(DBI)
con <- dbConnect(RAthena::athena())
res <- dbExecute(con, "select * from some_big_table limit 10000")
dbFetch(res, 5000)dbWriteTable opts to use alter table instead of standard msck repair table. This is to improve performance when appending to tables with high number of existing partitions.dbWriteTable now allows json to be appended to json ddls created with the Openx-JsonSerDe library.dbConvertTable brings dplyr::compute functionality to base package, allowing RAthena to use the power of AWS Athena to convert tables and queries to more efficient file formats in AWS S3 (#37).dplyr::compute to give same functionality of dbConvertTableboto3 not being detected has been updated. This is due to several users not sure how to get RAthena set-up.stop("Boto3 is not detected please install boto3 using either: `pip install boto3 numpy` in terminal or `install_boto()`.",
"\nIf this doesn't work please set the python you are using with `reticulate::use_python()` or `reticulate::use_condaenv()`",
call. = FALSE)
region_name check before making a connection to AWS Athena (#110)dbWriteTable would throw throttling error every now and again, retry_api_call as been built to handle the parsing of data between R and AWS S3.dbWriteTable did not clear down all metadata when uploading to AWS AthenadbWriteTable added support ddl structures for user who have created ddl’s outside of RAthenaRAthena retry functionality\dontrun (#108)pyathena, RAthena_options now has a new parameter cache_size. This implements local caching in R environments instead of using AWS list_query_executions. This is down to dbClearResult clearing S3’s Athena output when caching isn’t disabledRAthena_options now has clear_cache parameter to clear down all cached data.dbRemoveTable now utilise AWS Glue to remove tables from AWS Glue catalogue. This has a performance enhancement:library(DBI)
con = dbConnect(RAthena::athena())
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.131 0.037 2.404
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.065 0.009 1.303 dbWriteTable now supports uploading json lines (http://jsonlines.org/) format up to AWS Athena (#88).library(DBI)
con = dbConnect(RAthena::athena())
dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")dbWriteTable appending to existing table compress file type was incorrectly return.install_boto added numpy to RAthena environment install as reticulate appears to favour environments with numpy (https://github.com/rstudio/reticulate/issues/216)Rstudio connection tab comes into an issue when Glue Table isn’t stored correctly (#92)AWS_REGION into dbConnectfwrite (>=1.12.4) https://github.com/Rdatatable/data.table/blob/master/NEWS.mdsql_translate_env (#44)# Before
dbplyr::translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
dbplyr::translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'paste/paste0 would use default dplyr:sql-translate-env (concat_ws). paste0 now uses Presto’s concat function and paste now uses pipes to get extra flexibility for custom separating values.append set to TRUE then existing s3.location will be utilised (#73)db_compute returned table name, however when a user wished to write table to another location (#74). An error would be raised: Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not exist This has now been fixed with db_compute returning dbplyr::in_schema.library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
tbl(con, "iris") %>%
compute(name = "temp.iris")dbListFields didn’t display partitioned columns. This has now been fixed with the call to AWS Glue being altered to include more metadata allowing for column names and partitions to be returned.dbListFieldsRAthena_options
vroom has been restricted to >= 1.2.0 due to integer64 support and changes to vroom apidbStatistics is a wrapper around boto3 get_query_execution to return statistics for RAthena::dbSendQuery results (#67)dbGetQuery has new parameter statistics to print out dbStatistics before returning Athena results (#67)s3.location now follows new syntax s3://bucket/{schema}/{table}/{partition}/{table_file} to align with Pyathena and to allow tables with same name but in different schema to be uploaded to s3 (#73).dplyr::tbl when calling Athena when using the ident method (noctua # 64):library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
# ident method:
t1 <- system.time(tbl(con, "iris"))
# sub query method:
t2 <- system.time(tbl(con, sql("select * from iris")))
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660 dplyr sql_translate_env: expected results have now been updated to take into account bug fix with date fieldsdata.table to vroom. From now on it is possible to change file parser using RAthena_options for example:dbGetTables that returns Athena hierarchy as a data.framevroomUpdated R documentation to roxygen2 7.0.2
dbWriteTable append parameter checks and uses existing AWS Athena DDL file type. If file.type doesn’t match Athena DDL file type then user will receive a warning message:warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)tolower conversion due to request #41dbRemoveTable can now remove S3 files for AWS Athena table being removed.as.character was getting wrongly translated #45INTEGER being incorrectly translated in sql_translate_env.Rdata-transferdbRemoveTable new parameters are added in unit testsql_translate_env until test to cater bug fixdbWriteTable now will split gzip compressed files to improve AWS Athena performance. By default gzip compressed files will be split into 20.Performance results
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
# upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splitsAWS Athena performance results from AWS console (query executed: select count(*) from .... ):
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
copy_to method now supports compress and max_batch, to align with dbWriteTabledbWriteTablePOSIXct to Athena. This class was convert incorrectly and AWS Athena would return NA instead. RAthena will now correctly convert POSIXct to timestamp but will also correct read in timestamp into POSIXctNA in string format. Before RAthena would return NA in string class as "" this has now been fixed.RAthena would translate output into a vector with current the method dbFetch n = 0.sql_translate_env. Previously RAthena would take the default dplyr::sql_translate_env, now RAthena has a custom method that uses Data types from: https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window functions from: https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.htmlPOSIXct class has now been added to data transfer unit testdplyr sql_translate_env tests if R functions are correct translated in to Athena sql syntax.dbWriteTable is called. The bug is due to function sqlCreateTable which dbWriteTable calls. Parameters table and fields were set to NULL. This has now been fixed.s3.location parameter is dbWriteTable can now be made nullablesqlCreateTable info message will now only inform user if colnames have changed and display the column name that have changedupload_data has been rebuilt and removed the old “horrible” if statement with paste now the function relies on sprintf to construct the s3 location path. This method now is a lot clearer in how the s3 location is created plus it enables a dbWriteTable to be simplified. dbWriteTable can now upload data to the default s3_staging directory created in dbConnect this simplifies dbWriteTable to :dbWriteTabledata transfer test now tests compress, and default s3.location when transferring datadata.table::fread. This enables data types to be read in correctly and not required a second stage to convert data types once data has been read into Rdata.table::fread and data.table::fwrite have been disabledutil functions from namespace: write.table, read.csvdata.table to namespacebigint are convert into R bit64::integer64 and visa versabigint to integer64 in data.transfer unit testdbConnect methoddbFetch with chunk sizes between 0 - 999. Fixed error where for loop would return error instead of breaking.py_error function, set call. parameter to FALSEAthenaQuery s4 class changed to AthenaResultdbFetch added datatype collectiondbFetch replaced S3 search for query key with output location from AthenadbClearResult changed error, to return python error as warning to warn user doesn’t have permission to delete S3 resourcedbClearResult replaced S3 search for query key with out location from AthenadbListTables now returns vector of tables from aws glue instead of using an AWS Athena query. This method increases speed of call of querydbListFields now returns column names from aws glue instead of using an AWS Athena query.. This method increases speed of call of querydbExistsTable now returns boolean from aws glue instead of using an AWS Athena query.. This method increases speed of call of querycreate_work_group: Creates a workgroup with the specified name.delete_work_group: Deletes the workgroup with the specified name.list_work_group: Lists available workgroups for the account.get_work_group: Returns information about the workgroup with the specified name.update_work_group: Updates the workgroup with the specified name. The workgroup’s name cannot be changed.get_session_token to create temporary session credentialsassume_role to assume AWS ARN RoledbConnectset_aws_env to set aws tokens to environmental variablesget_aws_env to return expected results from system variablestag_options to create tag options for create_work_groupwork_group_config and work_group_config_update to create config of work groupAthenaConnectiondbColumnInfo method: returns data.frame containing field_name and typetime_check to check how long is left on the Athena Connection, if less than 15 minutes a warning message is outputted to notify userdb_collect for better integration with dplyrdb_save_query for better integration with dplyrdb_copy_to for better integration with dplyrdbFetch Athena data type miss alignmentAthenaConnection:
request build Athena query requestdb_descdbConnectstop_query_execution to dbClearResult if the query is still runningdbWriteTable)waiter to poll, to align with python’s polling