When working with data, the ability to remove duplicates from your data can be very valuable.
PROC SORT gives many different options for you to use which can change the behavior of what PROC SORT does.
Two options which are useful in PROC SORT which allow you to remove duplicates are ‘nodup’ and ‘nodupkey’.
In this article, you will learn the difference between the ‘nodup’ and ‘nodupkey’ PROC SORT options.
The Difference Between nodupkey and nodup Options When Using PROC SORT in SAS
PROC SORT is most used to sort data in SAS, but you can also use PROC SORT to remove duplicates with different options.
Two options which are available for us to use which are very useful are ‘nodup’ and ‘nodupkey’.
‘nodup’ removes duplicate observations and looks at the entire observation instead of just specified columns, while the ‘nodupkey’ option to remove observations with duplicate BY values. In other words, you can remove duplicates by key variables.
‘nodup’ is different from ‘nodupkey’ as ‘nodupkey’ removes duplicates based on specific columns and ‘nodup’ doesn’t consider any specified columns.
Let’s take a look at an example.
Let’s say we have the following dataset. We can see that there are a few duplicate values in the data.
data example;
input a b;
datalines;
1 2
1 2
1 2
2 6
2 6
2 6
2 8
;
run;
Let’s take a look at the behavior when using the ‘nodup’ option.
data example;
input a b;
datalines;
1 2
1 2
1 2
2 6
2 6
2 6
2 8
;
run;
proc sort data=example nodup;
by a;
run;
/* example After PROC SORT */
a b
1 2
2 6
2 8
As you can see, the duplicates were removed. Now, let’s see what happens when we use ‘nodupkey’ now.
data example;
input a b;
datalines;
1 2
1 2
1 2
2 6
2 6
2 6
2 8
;
run;
proc sort data=example nodupkey;
by a;
run;
/* example After PROC SORT */
a b
1 2
2 6
As you can see, there are more duplicates removed because we are only looking at the column ‘a’.
Hopefully this article has been useful for you to learn the difference between ‘nodup’ and ‘nodupkey’ in SAS when removing duplicates.