Download data sets from Eurostat database and put in a standardized format.

get_eurostat_bulk(
  id,
  cache = TRUE,
  update_cache = FALSE,
  cache_dir = NULL,
  compress_file = TRUE,
  stringsAsFactors = TRUE,
  select_freq = NULL,
  keep_flags = FALSE,
  cflags = FALSE,
  check_toc = FALSE,
  verbose = FALSE,
  ...
)

Arguments

id

a code name for the dataset of interest. See search_eurostat_toc for details how to get an id.

cache

a logical value whether to do caching. Default is TRUE.

update_cache

a logical value with a default value FALSE, whether to update cache. Can be set also with options(restatapi_update=TRUE).

cache_dir

a path to a cache directory. The NULL (default) uses the memory as cache. If the folder cache_dir directory does not exist it saves in the 'restatapi' directory under the temporary directory from tempdir(). Directory can also be set with option(restatapi_cache_dir=...).

compress_file

a logical value whether to compress the RDS-file in caching. Default is TRUE.

stringsAsFactors

a logical value with the default TRUE. In this case the columns are converted to factors. If FALSE, the strings are returned as characters.

select_freq

a character symbol for a time frequency when a dataset has multiple time frequencies. Possible values are: A = annual, S = semi-annual, H = half-year, Q = quarterly, M = monthly, W = weekly, D = daily. The default is NULL as most datasets have only one time frequency. In case if there are multiple frequencies and select_freq=NULL, then only the most common frequency kept. If all the frequencies needed the get_eurostat_raw function can be used.

keep_flags

a logical value whether the observation status (flags) - e.g. "confidential", "provisional", etc. - should be kept in a separate column or if they can be removed. Default is FALSE. For flag values see: https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/codelist/ESTAT/OBS_STATUS/?compressed=false&format=TSV&lang=en.

cflags

a logical value whether the missing observations with flag 'c' - "confidential" should be kept or not. Default is FALSE, in this case these observations dropped from the dataset. If this parameter TRUE then all the flags and the suppressed observations with missing values are kept. In this case the parameter provided in keep_flags is set to TRUE.

check_toc

a logical value whether to check the provided id in the Table of Contents (TOC) or not. The default value FALSE, in this case the base URL for the download link is retrieved from the configuration file. If the value is TRUE then the TOC is downloaded and the id is checked in it. If it found there then the download link is retrieved form the TOC.

verbose

a logical value with default FALSE, so detailed messages (for debugging) will not printed. Can be set also with options(restatapi_verbose=TRUE).

...

other parameter(s) to pass on the load_cfg function

Value

a data.table with the following columns:

dimension namesOne column for each dimension in the data
timeA column for the time dimension
valuesA column for numerical values
flagsA column for flags if the keep_flags=TRUE or cflags=TRUE otherwise this column is not included in the data table

The data.table does not include all missing values. The missing values are dropped if both the value and the flag is missing on a particular time.

Details

Data sets are downloaded from the Eurostat bulk download facility in TSV format as in this case smaller file has to be downloaded and processed. If there is more then one frequency then the dataset is filtered for a unique time frequency. If no frequency is selected and there are multiple frequencies in the dataset, then the most common value is used used for frequency.

Compared to the ouptut of the get_eurostat_raw function, the frequency (FREQ) and time format (TIME_FORMAT) columns are not included in the bulk data and the column names for the time period, observation values and status have standardised names: "time", "values" and "flags" independently if the data was downloaded previously in SDMX or TSV format.

By default all datasets cached as they are often rather large. The datasets cached in memory (default) or can be stored in a temporary directory if cache_dir or option(restatpi_cache_dir) is defined. The cache can be emptied with clean_restatapi_cache.

The id, is a value from the code column of the table of contents (get_eurostat_toc), and can be searched for it with the search_eurostat_toc function. The id value can be retrieved from the Eurostat database as well. The Eurostat database gives codes in the Data Navigation Tree after every dataset in parenthesis.

Examples

# \dontshow{
if (parallel::detectCores()<=2){
   options(restatapi_cores=1)
}else{
   options(restatapi_cores=2)
}    
# }
# \donttest{
if (!(grepl("amzn|-aws|-azure ",Sys.info()['release']))) options(timeout=2)
head(get_eurostat_bulk("agr_r_milkpr",keep_flags=TRUE))
#>    milkitem dairyprod    geo   time values  flags
#>      <fctr>    <fctr> <fctr> <fctr>  <num> <fctr>
#> 1:      PRO    D1110A     CY   1995  139.0       
#> 2:      PRO    D1110A    CY0   1995  139.0       
#> 3:      PRO    D1110A   CY00   1995  139.0       
#> 4:      PRO    D1110A     EE   1995  706.6       
#> 5:      PRO    D1110A    EE0   1995  706.6       
#> 6:      PRO    D1110A   EE00   1995  706.6       
options(restatapi_update=TRUE)
head(get_eurostat_bulk("avia_par_ee",check_toc=TRUE))
#> There are multiple frequencies in the dataset. The 'M' is selected as it is the most common frequency.
#>      unit tra_meas         airp_pr    time values
#>    <fctr>   <fctr>          <fctr>  <fctr>  <num>
#> 1: FLIGHT  CAF_PAS EE_EETN_BE_EBBR 2005-01     34
#> 2: FLIGHT  CAF_PAS EE_EETN_CZ_LKPR 2005-01     85
#> 3: FLIGHT  CAF_PAS EE_EETN_DE_EDDB 2005-01     62
#> 4: FLIGHT  CAF_PAS EE_EETN_DE_EDDF 2005-01    114
#> 5: FLIGHT  CAF_PAS EE_EETN_DE_EDDH 2005-01     26
#> 6: FLIGHT  CAF_PAS EE_EETN_DE_EDDT 2005-01     26
head(get_eurostat_bulk("avia_par_ee",select_freq="A",verbose=TRUE))
#> 
#> get_eurostat_bulk - API version:2
#> get_eurostat_bulk - class of id, cache, update_cache, cache_dir, compress_file, stringsAsFactors, keep_flags, check_toc, melt, verbose:
#> character - logical -logical - NULL - logical - logical - logical - logical - logical - logical
#> 
#> get_eurostat_raw - API version:2
#> get_eurostat_raw - bulk url: https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/avia_par_ee?format=TSV&compressed=true
#> get_eurostat_raw - class(raw): data.tabledata.frame
#> get_eurostat_raw - caching in raw: FALSE
#> get_eurostat_raw - local filter: FALSE
#> get_eurostat_raw - called from: restatapi::get_eurostat_rawidtxtcacheupdate_cachecache_dircompress_filestringsAsFactorskeep_flagscheck_tocTRUEverbose
#> get_eurostat_raw - get_eurostat_raw in sys.call(): FALSE
#>      unit tra_meas         airp_pr   time values
#>    <fctr>   <fctr>          <fctr> <fctr>  <num>
#> 1: FLIGHT  CAF_PAS EE_EETN_AT_LOWW   2001    540
#> 2: FLIGHT  CAF_PAS EE_EETN_DE_EDDF   2001    415
#> 3: FLIGHT  CAF_PAS EE_EETN_DK_EKCH   2001   1829
#> 4: FLIGHT  CAF_PAS EE_EETN_FI_EFHK   2001   4464
#> 5: FLIGHT  CAF_PAS EE_EETN_LT_EYVI   2001   1083
#> 6: FLIGHT  CAF_PAS EE_EETN_RU_UUEE   2001    869
options(restatapi_update=FALSE)
head(get_eurostat_bulk("agr_r_milkpr",cache_dir=tempdir(),compress_file=FALSE,verbose=TRUE))
#> 
#> get_eurostat_bulk - API version:2
#> get_eurostat_bulk - class of id, cache, update_cache, cache_dir, compress_file, stringsAsFactors, keep_flags, check_toc, melt, verbose:
#> character - logical -logical - character - logical - logical - logical - logical - logical - logical
#> 
#> get_eurostat_raw - API version:2
#> get_eurostat_raw - bulk url: https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/agr_r_milkpr?format=TSV&compressed=true
#> get_eurostat_raw - class(raw): data.tabledata.frame
#> get_eurostat_raw - caching in raw: FALSE
#> get_eurostat_raw - local filter: FALSE
#> get_eurostat_raw - called from: restatapi::get_eurostat_rawidtxtcacheupdate_cachecache_dircompress_filestringsAsFactorskeep_flagscheck_tocTRUEverbose
#> get_eurostat_raw - get_eurostat_raw in sys.call(): FALSE
#>    milkitem dairyprod    geo   time values
#>      <fctr>    <fctr> <fctr> <fctr>  <num>
#> 1:      PRO    D1110A     CY   1995  139.0
#> 2:      PRO    D1110A    CY0   1995  139.0
#> 3:      PRO    D1110A   CY00   1995  139.0
#> 4:      PRO    D1110A     EE   1995  706.6
#> 5:      PRO    D1110A    EE0   1995  706.6
#> 6:      PRO    D1110A   EE00   1995  706.6
clean_restatapi_cache(cache_dir=tempdir(),verbose=TRUE)
#> 
#> clean_restatapi_cache - All objects (outside of 'cfg', 'rav', 'cc' and 'dmethod') are removed from '.restatapi_env'.
#> clean_restatapi_cache - The cache folder /tmp/Rtmp2iECG6/restatapi is empty.
options(timeout=60)
# }