Each successful call of the nass_data()
command will return a data.frame object with 39 variables, although a handful of these variables will have the same value for each observation in the data.frame due to the nature of setting parameters for a query. The resulting data.frame is of the long variety with the Value
variable as the numerical variable of interest for the query.
The official documentation for each of these variables from the USDA are as follows:
week_ending
- Week ending date, used when freq_desc = WEEKLY.state_name
- State full name.country_code
- US Census Bureau, Foreign Trade Division 4-digit country code, as of April, 2007.location_desc
- Full description for the location dimension.begin_code
- If applicable, a 2-digit code corresponding to the beginning of the reference period (e.g., for freq_desc = MONTHLY, begin_code ranges from 01 (January) to 12 (December)).zip_5
- US Postal Service 5-digit zip code.county_ansi
- ANSI standard 3-digit county codes.state_alpha
- State abbreviation, 2-character alpha code.util_practice_desc
- Utilizations (e.g., GRAIN, FROZEN, SLAUGHTER) or marketing channels (e.g., FRESH MARKET, PROCESSING, RETAIL).domain_desc
- Generally another characteristic of operations that produce a particular commodity (e.g., ECONOMIC CLASS, AREA OPERATED, NAICS CLASSIFICATION, SALES). For chemical usage data, the domain describes the type of chemical applied to the commodity. The domain = TOTAL will have no further breakouts; i.e., the data value pertains completely to the short_desc.asd_desc
- Ag statistics district name.freq_desc
- Length of time covered (ANNUAL, SEASON, MONTHLY, WEEKLY, POINT IN TIME). MONTHLY often covers more than one month. POINT IN TIME is as of a particular day.prodn_practice_desc
- A method of production or action taken on the commodity (e.g., IRRIGATED, ORGANIC, ON FEED).end_code
- If applicable, a 2-digit code corresponding to the end of the reference period (e.g., the reference period of JAN THRU MAR will have begin_code = 01 and end_code = 03).sector_desc
- Five high level, broad categories useful to narrow down choices (CROPS, ANIMALS & PRODUCTS, ECONOMICS, DEMOGRAPHICS, and ENVIRONMENTAL).short_desc
- A concatenation of six columns: commodity_desc, class_desc, prodn_practice_desc, util_practice_desc, statisticcat_desc, and unit_desc.country_name
- County name.Value
- Published data value or suppression reason code.reference_period_desc
- The specific time frame, within a freq_desc.CV (%)
- Coefficient of variation. Available for the 2012 Census of Agriculture only. County-level CVs are generalized.class_desc
- Generally a physical attribute (e.g., variety, size, color, gender) of the commodity.asd_code
- NASS defined county groups, unique within a state, 2-digit ag statistics district code.agg_level_desc
- Aggregation level or geographic granularity of the data (e.g., STATE, AG DISTRICT, COUNTY, REGION, ZIP CODE).county_name
- Country name.region_desc
- NASS defined geographic entities not readily defined by other standard geographic levels. A region can be a less than a state (SUB-STATE) or a group of states (MULTI-STATE), and may be specific to a commodity.watershed_desc
- Name assigned to the HUC.state_ansi
- American National Standards Institute (ANSI) standard 2-digit state codes.congr_district_code
- US Congressional District 2-digit code.domaincat_desc
- Categories or partitions within a domain (e.g., under domain = SALES, domain categories include $1,000 TO $9,999, $10,000 TO $19,999, etc).state_fips_code
- NASS 2-digit state codes; include 99 and 98 for US TOTAL and OTHER STATES, respectively; otherwise match ANSI codes.group_desc
- Subsets within sector (e.g., under sector = CROPS, the groups are FIELD CROPS, FRUIT & TREE NUTS, HORTICULTURE, and VEGETABLES).watershed_code
- US Geological Survey (USGS) 8-digit Hydrologic Unit Code (HUC) for watersheds.unit_desc
- The unit associated with the statistic category (e.g., ACRES, $ / LB, HEAD, $, OPERATIONS).source_desc
- Source of data (CENSUS or SURVEY). Census program includes the Census of Ag as well as follow up projects. Survey program includes national, state, and county surveys.load_time
- Date and time indicating when record was inserted into Quick Stats database.county_code
- NASS 3-digit county codes; includes 998 for OTHER (COMBINED) COUNTIES and Alaska county codes; otherwise match ANSI codes.statisticcat_desc
- The aspect of a commodity being measured (e.g., AREA HARVESTED, PRICE RECEIVED, INVENTORY, SALES).commodity_desc
- The primary subject of interest (e.g., CORN, CATTLE, LABOR, TRACTORS, OPERATORS).year
- The numeric year of the data.I learn best through examples, so I’ll cover a few different levels of analysis and subtleties related to the data.
We can set a query where we return all data at the county level in Ohio related to rent, which is equivalent to https://quickstats.nass.usda.gov/ setting “Geographic Level” to COUNTY, “State” equal to Ohio, and “Commodity” equal to RENT :
library("usdarnass")
library("dplyr") # Helpful package
ohio_rent <- nass_data(commodity_desc = "RENT", agg_level_desc = "COUNTY",
state_name = "OHIO")
glimpse(ohio_rent)
#> Observations: 2,006
#> Variables: 39
#> $ source_desc <chr> "CENSUS", "CENSUS", "CENSUS", "CENSUS", "C…
#> $ sector_desc <chr> "ECONOMICS", "ECONOMICS", "ECONOMICS", "EC…
#> $ group_desc <chr> "EXPENSES", "EXPENSES", "EXPENSES", "EXPEN…
#> $ commodity_desc <chr> "RENT", "RENT", "RENT", "RENT", "RENT", "R…
#> $ class_desc <chr> "CASH, LAND & BUILDINGS", "CASH, LAND & BU…
#> $ prodn_practice_desc <chr> "ALL PRODUCTION PRACTICES", "ALL PRODUCTIO…
#> $ util_practice_desc <chr> "ALL UTILIZATION PRACTICES", "ALL UTILIZAT…
#> $ statisticcat_desc <chr> "EXPENSE", "EXPENSE", "EXPENSE", "EXPENSE"…
#> $ unit_desc <chr> "$", "$", "$", "$", "$", "$", "$", "$", "$…
#> $ short_desc <chr> "RENT, CASH, LAND & BUILDINGS - EXPENSE, M…
#> $ domain_desc <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL…
#> $ domaincat_desc <chr> "NOT SPECIFIED", "NOT SPECIFIED", "NOT SPE…
#> $ agg_level_desc <chr> "COUNTY", "COUNTY", "COUNTY", "COUNTY", "C…
#> $ state_ansi <chr> "39", "39", "39", "39", "39", "39", "39", …
#> $ state_fips_code <chr> "39", "39", "39", "39", "39", "39", "39", …
#> $ state_alpha <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", …
#> $ state_name <chr> "OHIO", "OHIO", "OHIO", "OHIO", "OHIO", "O…
#> $ asd_code <chr> "10", "10", "10", "10", "10", "10", "10", …
#> $ asd_desc <chr> "NORTHWEST", "NORTHWEST", "NORTHWEST", "NO…
#> $ county_ansi <chr> "003", "003", "003", "003", "003", "039", …
#> $ county_code <chr> "003", "003", "003", "003", "003", "039", …
#> $ county_name <chr> "ALLEN", "ALLEN", "ALLEN", "ALLEN", "ALLEN…
#> $ region_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ zip_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ watershed_code <chr> "00000000", "00000000", "00000000", "00000…
#> $ watershed_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ congr_district_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ country_code <chr> "9000", "9000", "9000", "9000", "9000", "9…
#> $ country_name <chr> "UNITED STATES", "UNITED STATES", "UNITED …
#> $ location_desc <chr> "OHIO, NORTHWEST, ALLEN", "OHIO, NORTHWEST…
#> $ year <chr> "2017", "2012", "2007", "2002", "1997", "2…
#> $ freq_desc <chr> "ANNUAL", "ANNUAL", "ANNUAL", "ANNUAL", "A…
#> $ begin_code <chr> "00", "00", "00", "00", "00", "00", "00", …
#> $ end_code <chr> "00", "00", "00", "00", "00", "00", "00", …
#> $ reference_period_desc <chr> "YEAR", "YEAR", "YEAR", "YEAR", "YEAR", "Y…
#> $ week_ending <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ load_time <chr> "2018-02-01 00:00:00", "2012-12-31 00:00:0…
#> $ Value <chr> "10,693,000", "11,067,000", "7,001,000", "…
#> $ `CV (%)` <chr> "18.4", "9.6", NA, NA, NA, "18.4", "9.6", …
The agg_level_desc
, commodity_desc
, and state_name
variables are all the same because the query parameters were set on those values. It turns out a few other variables will be identical for the whole data.frame because they do not vary based on county level observations in Ohio: country_code
, state_alpha
, state_ansi
, and state_fips_code
.
There are a fair amount of other variables which are all the same for the entire data.frame, but these variables are not the same because of the regional aggregation variables but because we have subset the data by the commodity_desc
as “RENT”.
The various other parameters that we could have set in this query of interest are from the short_desc
parameter, which we could use the nass_param()
function to view the options for this data item:
nass_param("short_desc", commodity_desc = "RENT", agg_level_desc = "COUNTY", state_name = "OHIO")
#> [1] "RENT, CASH, CROPLAND, IRRIGATED - EXPENSE, MEASURED IN $ / ACRE"
#> [2] "RENT, CASH, CROPLAND, NON-IRRIGATED - EXPENSE, MEASURED IN $ / ACRE"
#> [3] "RENT, CASH, LAND & BUILDINGS - EXPENSE, MEASURED IN $"
#> [4] "RENT, CASH, LAND & BUILDINGS - OPERATIONS WITH EXPENSE"
#> [5] "RENT, CASH, PASTURELAND - EXPENSE, MEASURED IN $ / ACRE"
#> [6] "RENT, PER HEAD OR ANIMAL UNIT MONTH - OPERATIONS WITH EXPENSE"
Alternatively, in the previously returned query with the nass_data()
function, we could view the frequency of the different short_desc
variables to get to the same outcome but with the additional benefit of knowing the number of observations:
table(ohio_rent$short_desc)
#>
#> RENT, CASH, CROPLAND, IRRIGATED - EXPENSE, MEASURED IN $ / ACRE
#> 19
#> RENT, CASH, CROPLAND, NON-IRRIGATED - EXPENSE, MEASURED IN $ / ACRE
#> 719
#> RENT, CASH, LAND & BUILDINGS - EXPENSE, MEASURED IN $
#> 440
#> RENT, CASH, LAND & BUILDINGS - OPERATIONS WITH EXPENSE
#> 440
#> RENT, CASH, PASTURELAND - EXPENSE, MEASURED IN $ / ACRE
#> 311
#> RENT, PER HEAD OR ANIMAL UNIT MONTH - OPERATIONS WITH EXPENSE
#> 77
In Ohio, the dominant form of cash rent is for non-irrigated cropland and most counties in Ohio are surveyed in the state and have a usable value for cash rent in a year. But this is not exactly right, so we can see this by subsetting our original query for only non-irrigated cropland and look at the number of counties in each year’s observation.
non_irrigated <- ohio_rent %>%
filter(grepl("NON-IRRIGATED", short_desc))
table(non_irrigated$year) # Observation per year
#>
#> 2008 2009 2010 2011 2012 2013 2014 2016 2017
#> 64 82 79 83 82 83 83 82 81
The cash rent values begin in 2008 with a small subset of counties in Ohio and then cover the vast majority of the state in 2009 onward. As it turns out, the entire state is surveyed but some counties do not have enough observations to have a statistically relevant sample and are thus combined at the agricultural reporting district level. This can be see with a listing of all of the counties in Ohio with rent data available:
table(non_irrigated$county_name)
#>
#> ADAMS ALLEN
#> 9 9
#> ASHLAND ASHTABULA
#> 8 9
#> ATHENS AUGLAIZE
#> 3 9
#> BELMONT BROWN
#> 6 9
#> BUTLER CARROLL
#> 9 9
#> CHAMPAIGN CLARK
#> 9 9
#> CLERMONT CLINTON
#> 2 9
#> COLUMBIANA COSHOCTON
#> 9 9
#> CRAWFORD DARKE
#> 9 9
#> DEFIANCE DELAWARE
#> 8 8
#> ERIE FAIRFIELD
#> 8 9
#> FAYETTE FRANKLIN
#> 9 6
#> FULTON GALLIA
#> 9 5
#> GEAUGA GREENE
#> 8 9
#> GUERNSEY HANCOCK
#> 7 8
#> HARDIN HARRISON
#> 9 7
#> HENRY HIGHLAND
#> 9 9
#> HOCKING HOLMES
#> 8 9
#> HURON JACKSON
#> 9 4
#> JEFFERSON KNOX
#> 5 7
#> LAWRENCE LICKING
#> 7 9
#> LOGAN LORAIN
#> 9 9
#> LUCAS MADISON
#> 9 9
#> MAHONING MARION
#> 9 9
#> MEDINA MEIGS
#> 9 5
#> MERCER MIAMI
#> 9 9
#> MONROE MONTGOMERY
#> 6 9
#> MORGAN MORROW
#> 8 9
#> MUSKINGUM NOBLE
#> 9 6
#> OTHER (COMBINED) COUNTIES OTTAWA
#> 36 9
#> PAULDING PERRY
#> 9 8
#> PICKAWAY PIKE
#> 9 8
#> PORTAGE PREBLE
#> 9 9
#> PUTNAM RICHLAND
#> 9 8
#> ROSS SANDUSKY
#> 9 9
#> SCIOTO SENECA
#> 8 8
#> SHELBY STARK
#> 9 9
#> TRUMBULL TUSCARAWAS
#> 9 9
#> UNION VAN WERT
#> 9 7
#> VINTON WARREN
#> 4 7
#> WASHINGTON WAYNE
#> 9 9
#> WILLIAMS WOOD
#> 8 9
#> WYANDOT
#> 9
# nass_param("county_name", state_name = "OHIO")
The “OTHER (COMBINED) COUNTIES” value has by far and away the most observations over this time and, if one knows all of the counties in Ohio, there are a few missing counties that we need to input their values for. Each of the “OTHER (COMBINED) COUNTIES” values is for a specific agricultural reporting district:
non_irrigated %>%
filter(county_name == "OTHER (COMBINED) COUNTIES") %>%
pull(asd_code) %>%
table()
#> .
#> 10 20 30 50 60 70 80 90
#> 2 2 8 2 3 8 4 7
There are nine reporting districts in Ohio and eight of the nine have observations which combine counties for an observation – which implies some counties are missing official observations for the statistics in question. It is not necessarily the case that the counties are completely missing but they are suppressed. One method for correcting for these missing values is to replace the missing counties with the “OTHER (COMBINED) COUNTIES” category for average rent.
In order to do this, we first need a full set of all of the counties in Ohio along with their corresponding agricultural district number. There are many ways to accomplish this, but I will go about this by leveraging the 2012 Agricultural Census data which is in Quick Stats and contains a category for the number of farms in each Ohio county. The number of farms is not helpful for our data concerns but it does serve as a way to return a data.frame with 88 observations in Ohio with each uniquely corresponding to a county in Ohio.
At this point, we want to make sure that each year for our non_irrigated rent data has all 88 counties with an NA value if it does not exist. Then, we want to impute the “OTHER (COMBINED) COUNTIES” category for all of the NAs. To do this we will create the backbone of our desired observations from the farms and then fully merge this with the current data from Quick Stats. The resulting data.frame will have missing “Value” observations for the counties which do not have observations and we will impute the value from “OTHER (COMBINED) COUNTIES”. And last of all, with these data the only important aspect of these observations is that we have a county identifier, locational aspects, and the year in question. Much of the rest of the variables in the dataset are not of importance to us so we will have only keep the remaining important variables and convert them to numeric.
library("tidyr")
base_rent <- farms %>%
select(state_fips_code, county_code, county_name, asd_code, asd_desc) %>%
expand(year = unique(non_irrigated$year), nesting(state_fips_code, county_code, county_name, asd_code)) %>%
full_join(non_irrigated)
#> Joining, by = c("year", "state_fips_code", "county_code", "county_name", "asd_code")
# Correct for missing values in the "other"
base_rent <- base_rent %>%
arrange(year, asd_code, county_code) %>%
group_by(year, asd_code) %>%
mutate(Value = ifelse(is.na(Value), Value[county_code == "998"], Value)) %>%
filter(county_code != "998")
# Finally, select only the relevant variables are rename
base_rent <- base_rent %>%
select(year, state_fips_code, county_code, county_name, asd_code, rent = Value) %>%
mutate(rent = as.numeric(rent),
fips = as.numeric(paste0(state_fips_code, county_code)))
glimpse(base_rent)
#> Observations: 792
#> Variables: 7
#> Groups: year, asd_code [81]
#> $ year <chr> "2008", "2008", "2008", "2008", "2008", "2008", …
#> $ state_fips_code <chr> "39", "39", "39", "39", "39", "39", "39", "39", …
#> $ county_code <chr> "003", "039", "051", "063", "069", "095", "125",…
#> $ county_name <chr> "ALLEN", "DEFIANCE", "FULTON", "HANCOCK", "HENRY…
#> $ asd_code <chr> "10", "10", "10", "10", "10", "10", "10", "10", …
#> $ rent <dbl> 125.0, 99.0, 129.0, 104.0, 127.0, 116.0, 105.0, …
#> $ fips <dbl> 39003, 39039, 39051, 39063, 39069, 39095, 39125,…