Introduction to tidygeocoder

Geocoding services are used to provide data about locations such as longitude and latitude coordinates. The goal of tidygeocoder is to make getting data from these services easy. The two main functions to use are geocode() which takes a dataframe as an input and geo() which takes character values as inputs.

The geocode() function extracts specified address columns from the input dataframe and passes them to geo() to perform geocoding. All extra arguments (...) given to geocode() are passed to geo() so refer to the documentation in geo() for all the possible arguments you can give to the geocode() function.

Basic Queries

library(tibble)
library(DT)
library(dplyr)
library(tidygeocoder)

address_single <- tibble(singlelineaddress = c('11 Wall St, NY, NY', 
                    '600 Peachtree Street NE, Atlanta, Georgia'))
address_components <- tribble(
  ~street                      , ~cty,               ~st,
  '11 Wall St',                  'NY',               'NY',
  '600 Peachtree Street NE',     'Atlanta',          'GA'
)

You can use the address argument to specify single-line addresses. Note that when multiple addresses are provided, the batch geocoding functionality of the Census geocoder service is used. Additionally, verbose = TRUE displays logs to the console.

address_single %>% geocode(address = singlelineaddress, method = 'census',
                           verbose = TRUE)
#> Number of Unique Addresses: 2
#> Passing 2 addresses to the census batch geocoder
#> Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/addressbatch
#> Passing the following parameters to the API:
#> format : "json"
#> benchmark : "Public_AR_Current"
#> vintage : "Current_Current"
#> 
#> Query completed in: 1.5 seconds
#> # A tibble: 2 x 3
#>   singlelineaddress                           lat  long
#>   <chr>                                     <dbl> <dbl>
#> 1 11 Wall St, NY, NY                         40.7 -74.0
#> 2 600 Peachtree Street NE, Atlanta, Georgia  33.8 -84.4

Alternatively you can run the same query with the geo() function by passing the address values from the dataframe directly. In either geo() or geocode(), the lat and long arguments are used to name the resulting latitude and longitude fields. Here the method argument is used to specify the OSM (Nominatim) geocoder service.

geo(address = address_single$singlelineaddress, method = 'osm', 
    lat = latitude, long = longitude)
#> # A tibble: 2 x 3
#>   address                                   latitude longitude
#>   <chr>                                        <dbl>     <dbl>
#> 1 11 Wall St, NY, NY                            40.7     -74.0
#> 2 600 Peachtree Street NE, Atlanta, Georgia     33.8     -84.4

Instead of single-line addresses, you can use any combination of the following arguments to specify your addresses: street, city, state, county, postalcode, and country.

address_components %>% geocode(street = street, city = cty, state = st,
                               method = 'census')
#> # A tibble: 2 x 5
#>   street                  cty     st      lat  long
#>   <chr>                   <chr>   <chr> <dbl> <dbl>
#> 1 11 Wall St              NY      NY     40.7 -74.0
#> 2 600 Peachtree Street NE Atlanta GA     33.8 -84.4

The cascade method first tries to use one geocoder service and then again attempts to geocode addresses that were not found using a second geocoder service. By default it first uses the Census Geocoder and then OSM, but you can specify any two methods you want (in order) with the cascade_order argument.

addr_comp1 <- address_components %>% 
  bind_rows(tibble(cty = c('Toronto', 'Tokyo'), country = c('Canada', 'Japan')))

addr_comp1 %>% geocode(street = street, state = st, city = cty,
                       country = country, method = 'cascade')
#> # A tibble: 4 x 7
#>   street                  cty     st    country   lat  long geo_method
#>   <chr>                   <chr>   <chr> <chr>   <dbl> <dbl> <chr>     
#> 1 11 Wall St              NY      NY    <NA>     40.7 -74.0 census    
#> 2 600 Peachtree Street NE Atlanta GA    <NA>     33.8 -84.4 census    
#> 3 <NA>                    Toronto <NA>  Canada   43.6 -79.4 osm       
#> 4 <NA>                    Tokyo   <NA>  Japan    35.7 139.  osm

Beyond Latitude and Longitude

To return more data than just the latitude and longitude coordinates, specify full_results = TRUE. Additionally, for the Census geocoder you can get fields for geographies such as Census tracts by specifying return_type = 'geographies'. Be sure to use full_results = TRUE with return_type = 'geographies' in order to allow the Census geography columns to be returned.

census_full1 <- address_single %>% geocode(address = singlelineaddress, 
      method = 'census', full_results = TRUE, return_type = 'geographies')
glimpse(census_full1)
#> Rows: 2
#> Columns: 14
#> $ singlelineaddress <chr> "11 Wall St, NY, NY", "600 Peachtree Street NE, Atl…
#> $ lat               <dbl> 40.70747, 33.77085
#> $ long              <dbl> -74.01122, -84.38505
#> $ id                <int> 1, 2
#> $ input_address     <chr> "11 Wall St, NY, NY, , , ", "600 Peachtree Street N…
#> $ match_indicator   <chr> "Match", "Match"
#> $ match_type        <chr> "Exact", "Non_Exact"
#> $ matched_address   <chr> "11 WALL ST, NEW YORK, NY, 10005", "600 PEACHTREE S…
#> $ tiger_line_id     <int> 59659656, 17343689
#> $ tiger_side        <chr> "R", "L"
#> $ state_fips        <int> 36, 13
#> $ county_fips       <int> 61, 121
#> $ census_tract      <int> 700, 1900
#> $ census_block      <int> 1008, 2003

As mentioned earlier, the geocode() function passes addresses in dataframes to the geo() function for geocoding so we can also directly use geo() function in a similar way:

salz <- geo('Salzburg, Austria', method = 'osm', full_results = TRUE)
glimpse(salz)
#> Rows: 1
#> Columns: 13
#> $ address      <chr> "Salzburg, Austria"
#> $ lat          <dbl> 47.79813
#> $ long         <dbl> 13.04648
#> $ place_id     <int> 206608
#> $ licence      <chr> "Data © OpenStreetMap contributors, ODbL 1.0. https://os…
#> $ osm_type     <chr> "node"
#> $ osm_id       <int> 34964314
#> $ boundingbox  <list> [<"47.6381346", "47.9581346", "12.8864806", "13.2064806…
#> $ display_name <chr> "Salzburg, 5020, Österreich"
#> $ class        <chr> "place"
#> $ type         <chr> "city"
#> $ importance   <dbl> 0.6854709
#> $ icon         <chr> "https://nominatim.openstreetmap.org/images/mapicons/poi…

Working With Messy Data

Only unique addresses are passed to geocoder services even if your data contains duplicates. Missing/NA and blank addresses are excluded from queries.

duplicate_addrs <- address_single %>%
  bind_rows(address_single) %>%
  bind_rows(tibble(singlelineaddress = rep(NA, 3)))

duplicates_geocoded <- duplicate_addrs %>%
  geocode(singlelineaddress, verbose = T)
#> Number of Unique Addresses: 2
#> Passing 2 addresses to the census batch geocoder
#> Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/addressbatch
#> Passing the following parameters to the API:
#> format : "json"
#> benchmark : "Public_AR_Current"
#> vintage : "Current_Current"
#> 
#> Query completed in: 1.4 seconds

knitr::kable(duplicates_geocoded)
singlelineaddress lat long
11 Wall St, NY, NY 40.70747 -74.01122
600 Peachtree Street NE, Atlanta, Georgia 33.77085 -84.38505
11 Wall St, NY, NY 40.70747 -74.01122
600 Peachtree Street NE, Atlanta, Georgia 33.77085 -84.38505
NA NA NA
NA NA NA
NA NA NA

As shown above, duplicates will not be removed from your results by default. However, you can return only unique results by using unique_only = TRUE. Note that passing unique_only = TRUE to geocode() will result in the original dataframe format (including column names) to be discarded in favor of the standard field names (ie. “address”, “city”, “state”, etc.).

duplicate_addrs %>%
  geocode(singlelineaddress, unique_only = TRUE)
#> # A tibble: 2 x 3
#>   address                                     lat  long
#>   <chr>                                     <dbl> <dbl>
#> 1 11 Wall St, NY, NY                         40.7 -74.0
#> 2 600 Peachtree Street NE, Atlanta, Georgia  33.8 -84.4

Advanced Usage

The limit argument can be specified to return multiple matches per address if available:

geo_limit <- geo(c('Lima, Peru', 'Cairo, Egypt'), method = 'osm', 
    limit = 3, full_results = TRUE)
glimpse(geo_limit)
#> Rows: 4
#> Columns: 13
#> $ address      <chr> "Lima, Peru", "Lima, Peru", "Lima, Peru", "Cairo, Egypt"
#> $ lat          <dbl> -12.06211, -12.20011, -11.99997, 30.04882
#> $ long         <dbl> -77.03653, -76.28506, -76.83322, 31.24367
#> $ place_id     <int> 286976132, 235673177, 235480647, 236205997
#> $ licence      <chr> "Data © OpenStreetMap contributors, ODbL 1.0. https://os…
#> $ osm_type     <chr> "relation", "relation", "relation", "relation"
#> $ osm_id       <int> 1944756, 1944659, 1944670, 5466227
#> $ boundingbox  <list> [<"-12.0797663", "-12.0303496", "-77.0884555", "-77.001…
#> $ display_name <chr> "Lima, Peru", "Lima, Peru", "Lima, Peru", "القاهرة, Egyp…
#> $ class        <chr> "boundary", "boundary", "boundary", "place"
#> $ type         <chr> "administrative", "administrative", "administrative", "c…
#> $ importance   <dbl> 0.8930015, 0.7219761, 0.7034835, 0.7960286
#> $ icon         <chr> "https://nominatim.openstreetmap.org/images/mapicons/poi…

To directly specify specific API parameters for a given method you can use the custom_query parameter. For example, the Nominatim (OSM) geocoder has a ‘polygon_geojson’ argument that can be used to return GeoJSON geometry content. To pass this parameter you can insert it with a named list using the custom_query argument:

cairo_geo <- geo('Cairo, Egypt', method = 'osm', full_results = TRUE,
    custom_query = list(polygon_geojson = 1), verbose = TRUE)
#> Number of Unique Addresses: 1
#> Querying API URL: http://nominatim.openstreetmap.org/search
#> Passing the following parameters to the API:
#> limit : "1"
#> q : "Cairo, Egypt"
#> polygon_geojson : "1"
#> format : "json"
#> 
#> Query completed in: 0.3 seconds
#> Total query time (including sleep): 1 seconds
#> 
glimpse(cairo_geo)
#> Rows: 1
#> Columns: 15
#> $ address             <chr> "Cairo, Egypt"
#> $ lat                 <dbl> 30.04882
#> $ long                <dbl> 31.24367
#> $ place_id            <int> 236205997
#> $ licence             <chr> "Data © OpenStreetMap contributors, ODbL 1.0. htt…
#> $ osm_type            <chr> "relation"
#> $ osm_id              <int> 5466227
#> $ boundingbox         <list> [<"29.7483062", "30.3209168", "31.2200331", "31.…
#> $ display_name        <chr> "القاهرة, Egypt / مصر"
#> $ class               <chr> "place"
#> $ type                <chr> "city"
#> $ importance          <dbl> 0.7960286
#> $ icon                <chr> "https://nominatim.openstreetmap.org/images/mapic…
#> $ geojson.type        <chr> "Polygon"
#> $ geojson.coordinates <list> [<array[1 x 119 x 2]>]

To test a query without sending any data to a geocoder service, you can use no_query = TRUE (NA results are returned).

geo(c('Vancouver, Canada', 'Las Vegas, NV'), no_query = TRUE, 
    method = 'osm')
#> Number of Unique Addresses: 2
#> Executing single address geocoding...
#> 
#> Number of Unique Addresses: 1
#> Querying API URL: http://nominatim.openstreetmap.org/search
#> Passing the following parameters to the API:
#> limit : "1"
#> q : "Vancouver, Canada"
#> format : "json"
#> 
#> Number of Unique Addresses: 1
#> Querying API URL: http://nominatim.openstreetmap.org/search
#> Passing the following parameters to the API:
#> limit : "1"
#> q : "Las Vegas, NV"
#> format : "json"
#> 
#> # A tibble: 2 x 3
#>   address           lat   long 
#>   <chr>             <lgl> <lgl>
#> 1 Vancouver, Canada NA    NA   
#> 2 Las Vegas, NV     NA    NA

Here are some additional usage notes for the geocode() and geo() functions:

API Reference

You can refer to the api_parameter_reference dataset to see which which parameters are supported with each geocoder service. This dataset is displayed below.

Refer to ?api_parameter_reference for more details and links to the API documentation for each geocoder service.