readr 1.3.1

readr (development version)

readr 1.3.0

Breaking Changes

Blank line skipping

readr’s blank line skipping has been modified to be more consistent and to avoid edge cases that affected the behavior in 1.2.0. The skip parameter now behaves more similar to how it worked previous to readr 1.2.0, but in addition the parameter skip_blank_rows can be used to control if fully blank lines are skipped. (#923)

tibble data frame subclass

readr 1.3.0 returns results with a spec_tbl_df subclass. This differs from a regular tibble only that the spec attribute (which holds the column specification) is lost as soon as the object is subset (and a normal tbl_df object is returned).

Historically tbl_df’s lost their attributes once they were subset. However recent versions of tibble retain the attributes when subetting, so the spec_tbl_df subclass is needed to ensure the previous behavior.

This should only break compatibility if you are explicitly checking the class of the returned object. A way to get backwards compatible behavior is to call subset with no arguments on your object, e.g. x[].

Bugfixes

readr 1.2.1

This release skips the clipboard tests on CRAN servers

readr 1.2.0

Breaking Changes

Integer column guessing

readr functions no longer guess columns are of type integer, instead these columns are guessed as numeric. Because R uses 32 bit integers and 64 bit doubles all integers can be stored in doubles, guaranteeing no loss of information. This change was made to remove errors when numeric columns were incorrectly guessed as integers. If you know a certain column is an integer and would like to read them as such you can do so by specifying the column type explicitly with the col_types argument.

Blank line skipping

readr now always skips blank lines automatically when parsing, which may change the number of lines you need to pass to the skip parameter. For instance if your file had a one blank line then two more lines you want to skip previously you would pass skip = 3, now you only need to pass skip = 2.

New features

Melt functions

There is now a family of melt_*() functions in readr. These functions store data in ‘long’ or ‘melted’ form, where each row corresponds to a single value in the dataset. This form is useful when your data is ragged and not rectangular.

data <-"a,b,c
1,2
w,x,y,z"

readr::melt_csv(data)
#> # A tibble: 9 x 4
#>     row   col data_type value
#>   <dbl> <dbl> <chr>     <chr>
#> 1     1     1 character a    
#> 2     1     2 character b    
#> 3     1     3 character c    
#> 4     2     1 integer   1    
#> 5     2     2 integer   2    
#> 6     3     1 character w    
#> 7     3     2 character x    
#> 8     3     3 character y    
#> 9     3     4 character z

Thanks to Duncan Garmonsway (@nacnudus) for great work on the idea an implementation of the melt_*() functions!

Connection improvements

readr 1.2.0 changes how R connections are parsed by readr. In previous versions of readr the connections were read into an in-memory raw vector, then passed to the readr functions. This made reading connections from small to medium datasets fast, but also meant that the dataset had to fit into memory at least twice (once for the raw data, once for the parsed data). It also meant that reading could not begin until the full vector was read through the connection.

Now we instead write the connection to a temporary file (in the R temporary directory), than parse that temporary file. This means connections may take a little longer to be read, but also means they will no longer need to fit into memory. It also allows the use of the chunked readers to process the data in parts.

Future improvements to readr would allow it to parse data from connections in a streaming fashion, which would avoid many of the drawbacks of either method.

Additional new features

Bug Fixes

readr 1.1.1

readr 1.1.0

New features

Parser improvements

Whitespace / fixed width improvements

Writing to connections

Miscellaneous features

Bugfixes

readr 1.0.0

Column guessing

The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

And you can extract those values after the fact with spec():

spec(challenge)
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )

This makes it easier to quickly identify parsing problems and fix them (#314). If the column specification is long, the new cols_condense() is used to condense the spec by identifying the most common type and setting it as the default. This is particularly useful when only a handful of columns have a different type (#466).

You can also generating an initial specification without parsing the file using spec_csv(), spec_tsv(), etc.

Once you have figured out the correct column types for a file, it’s often useful to make the parsing strict. You can do this either by copying and pasting the printed output, or for very long specs, saving the spec to disk with write_rds(). In production scripts, combine this with stop_for_problems() (#465): if the input data changes form, you’ll fail fast with an error.

You can now also adjust the number of rows that readr uses to guess the column types with guess_max:

challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

You can now access the guessing algorithm from R. guess_parser() will tell you which parser readr will select for a character vector (#377). We’ve made a number of fixes to the guessing algorithm:

We have made a number of improvements to the reification of the col_types, col_names and the actual data:

Column parsing

The date time parsers recognise three new format strings:

%y and %Y are now strict and require 2 or 4 characters respectively.

Date and time parsing functions received a number of small enhancements:

parse_number() is slightly more flexible - it now parses numbers up to the first ill-formed character. For example parse_number("-3-") and parse_number("...3...") now return -3 and 3 respectively. We also fixed a major bug where parsing negative numbers yielded positive values (#308).

parse_logical() now accepts 0, 1 as well as lowercase t, f, true, false.

New readers and writers

Minor features and bug fixes

readr 0.2.2

readr 0.2.1

readr 0.2.0

Internationalisation

readr now has a strategy for dealing with settings that vary from place to place: locales. The default locale is still US centric (because R itself is), but you can now easily override the default timezone, decimal separator, grouping mark, day & month names, date format, and encoding. This has lead to a number of changes:

See vignette("locales") for more details.

File parsing improvements

Column parsing improvements

Readr gains vignette("column-types") which describes how the defaults work and how to override them (#122).

As well as improvements to the parser, I’ve also made a number of tweaks to the heuristics that readr uses to guess column types:

Minor improvements and bug fixes