PING  0.9
Statistical data handling and processing in production environment
ds_select

Select variables from a given dataset using the SQL-based SELECT method.

%ds_select(idsn, odsn, var=, varas=, varop=,
where=, groupby=, having=, orderby=,
_proc_=, distinct=no, all=no, ilib=WORK, olib=WORK);

Arguments

  • idsn : a dataset reference;
  • var : (option) list of fields/variables of idsn upon which the extraction is performed; default: var is empty and all variables are selected; note that (see below):
    • the operation of varop (see below) will operate on these variables only,
    • all other variables in the dataset can also be selected using the option all=YES;
  • varas : (option) list of alias (AS) to use for the variables in &var; if not empty, then should be of same length as &var; default: empty, i.e. the names of the variables are retained;
  • varop : (option) list of unary operations (e.g., min, max, ...) to run (separately) over the list of variables in &var; if not empty, then should be of same length as &var; note that the string _ID_ (see also variable G_PING_IDOP) is used for the identity operation (i.e., no operation); default: empty, i.e. _ID_ is ran over all variables;
  • where : (option) expression used to refine the selection (WHERE option); should be passed with %str or %quote; default: empty;
  • groupby : (option) list of variables of &idsn used to group data (GROUP BY option); default: empty;
  • having : (option) expression used to refine the selection (HAVING option); should be passed with %str; default: empty;
  • orderby : (option) list of variables of &idsn used to group data (ORDER BY option); default: empty;
  • distinct : (option) boolean flag (yes/no) set to use the DISTINCT option of the PROC SQL selection procedure;
  • all : (option) boolean flag (yes/no) set to keep all variables from input dataset &idsn; in practice, a * is added to the list of variables in &var into the SELECT operation; default: all=NO, i.e. only the variables in &var are present in the output dataset;
  • ilib : (option) name of the input library; by default: empty, i.e. WORK is used.

Returns

  • odsn : name of the output dataset; it will contain the selection operated on the original dataset;
  • olib : (option) name of the output library; by default: empty, i.e. WORK is also used;
  • _proc_ : (option) name of a variable containing the PROC SQL procedure; when passed, or when the debug mode is set, the operation is not actually ran, but it is returned (as a string) in this variable.

Examples

Let us consider the test dataset #35:

geo time EQ_INC20 RB050a
BE 2009 10 10
BE 2010 50 10
BE 2011 60 10
BE 2012 20 20
BE 2013 10 20
BE 2014 30 20
BE 2015 40 20
IT 2009 10 10
IT 2010 50 10
IT 2011 50 10
IT 2012 30 20
IT 2013 30 20
IT 2014 20 20
IT 2015 50 20

and run the following:

%let var= geo time EQ_INC20 RB050a;
%let varas= _ID_ _ID_ income weight;
%let varop= _ID_ _ID_ _ID_ min;
%let where= %str(EQ_INC20>20);
%let groupby= geo;
%_dstest35;
%ds_select(_dstest35, TMP, var=&var, varas=&varas, varop=&varop, distinct=no,
where=&where, groupby=&groupby);

to create the output table TMP:

geo time income weight
BE 2014 30 10
BE 2011 60 10
BE 2010 50 10
BE 2015 40 10
IT 2012 30 10
IT 2011 50 10
IT 2010 50 10
IT 2015 50 10
IT 2013 30 10

while in debug mode:

%let proc=;
%ds_select(_dstest35, TMP, var=&var, varas=&varas, varop=&varop, distinct=no,
where=&where, groupby=&groupby, _proc_=proc);

does not actually run the operation, but instead returns in proc the following string that describes the implemented procedure:

PROC SQL noprint; CREATE TABLE WORK.TMP AS SELECT geo, EQ_INC20 AS income,
min(RB050a) AS weight FROM WORK._dstest35 WHERE EQ_INC20>20 GROUP BY geo; quit;

Run macro %_example_ds_select for examples.

Notes

  1. In short, when only one variable var is passed (as well as one varas identifier and one operation varop at most), and with distinct=yes, all=yes, the macro runs the following PROC SQL procedure:
PROC SQL noprint;
CREATE TABLE &olib..&odsn AS
SELECT DISTINCT &varop(&var) AS &varas, *
FROM &ilib..&idsn
WHERE &where
GROUP BY &groupby
HAVING &having
ORDER BY &orderby
;
quit;
  1. In debug mode, or when _proc_ is set, the macro is used to defines the procedure instructions:
%let proc=;
%ds_select( ... , _proc_=proc);

so that proc is set to the procedure that launches the operation (see above), and while the actual operation is actually not ran. Further note that in the case the variable G_PING_DEBUG is not defined in your environment, and _proc_ is not set, debug mode is ignored (i.e., by default the operation is ran).

  1. For options/clause where, groupby and having, do not specify the corresponding keywords (i.e., WHERE, GROUP BY and HAVING respectively). Ibid, for the varas option, do no use the AS keyword in the list of names.
  2. Note that while clauses where and having should be passed using the %str string macro when running the process, %bquote should be preferred when running in debug mode.
  3. Remember the following differences between the HAVING and WHERE clauses:
DATA test;
a = 5; b = 6;
RUN;
PROC SQL;
CREATE TABLE test1 AS
SELECT (case when a=5 then 6
else 7 end) as a,
COUNT(*) AS dummy
FROM test
GROUP BY a
HAVING a=5;
CREATE TABLE test2 AS
SELECT (case when a=5 then 6
else 7 end) as a,
COUNT(*) AS dummy
FROM test
WHERE a=5
GROUP BY a;
quit;

create respectively test1 as:

a dummy

and test2 as:

a dummy
6 1

References

  1. Williams, C.S. (1999): "PROC SQL for DATA step die-hards".
  2. Dickstein, C. and Pass, R. (2004): "DATA step vs. PROC SQL: What's a neophyte to do?".
  3. Williams, C.S. (2008): "PROC SQL for DATA step die-hards".
  4. Marcella, S.P. and Jorgensen, G. (2009): "PROC SQL: Tips and translations for DATA step users".
  5. Bennet, J. and Ross, B. (2015): "PROC SQL for SQL die-hards".

See also

%ds_append, %ds_isempty, %ds_check, %sql_clause_by, %sql_clause_as, %obs_select, SELECT statement.