You're reading from Mastering SAS Programming for Data Warehousing An advanced programming guide to designing and managing Data Warehouses using SAS

Product type Paperback

Published in Oct 2020

Publisher Packt

ISBN-13 9781789532371

Length 494 pages

Edition 1st Edition

Tools

SAS

Concepts

Application Development

Author (1):

Monika Wahi

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1: Managing Data in a SAS Data Warehouse

2. Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse FREE CHAPTER

3. Chapter 2: Reading Big Data into SAS

4. Chapter 3: Helpful PROCs for Managing Data

5. Chapter 4: Managing ETL in SAS

6. Chapter 5: Managing Data Reporting in SAS

7. Section 2: Using SAS for Extract-Transform-Load (ETL) Protocols in a Data Warehouse

8. Chapter 6: Standardizing Coding Using SAS Arrays

9. Chapter 7: Designing and Developing ETL Code in SAS

10. Chapter 8: Using Macros to Automate ETL in SAS

11. Chapter 9: Debugging and Troubleshooting in SAS

12. Section 3: Using SAS When Serving Warehouse Data to Users

13. Chapter 10: Considering the User Needs of SAS Data Warehouses

14. Chapter 11: Connecting the SAS Data Warehouse to Other Systems

15. Chapter 12: Using the ODS for Visualization in SAS

16. Assessments

17. Other Books You May Enjoy

Leave a review - let other readers know what you think

Using original versions of SAS

Initially, SAS data had to be input through code into memory whenever analysis code was to be run on the data. This section covers the following:

How to enter data into SAS datasets using SAS
The early PROCs developed, such as PROC PRINT and PROC MEANS
Improvements to data handling made in Base SAS

In this section, you will learn how SAS's data management processes were initially developed. The processes impact how SAS runs today.

Initial SAS data handling

As described on SAS's website (https://www.sas.com/en_us/company-information/profile.html), SAS was invented in 1966 as the Statistical Analysis System, developed under a grant from the United States (US) National Institutes of Health (NIH) to eight universities. The immediate need was to develop a computerized system that could analyze the large amount of agriculture data being collected through the US Department of Agriculture (USDA).

According to the SAS history listed on the Barr Systems website (http://www.barrsystems.com/about_us/the_company/professional_history.asp), Anthony J. Barr was in the Statistics Department of North Carolina State University and was recruited to help program SAS. He was responsible for developing the first analysis of variance (ANOVA) and regression programs in SAS and created the software for inputting and transforming data.

Even today, it is relevant here to reflect on Barr's early development of what would later be called data step language in SAS. This is because current data import processes in SAS continue to use roughly the same approach, which presents both opportunities and limitations in data warehouse management.

In the early data step code, data was entered as part of the code, which still can be done today. Let's consider a modern example of using data step code to enter data in SAS by referring to the 2018 BRFSS Codebook listed in the technical requirements for this chapter. Each year, the United States Centers for Disease Control and Prevention (CDC) organizes an annual anonymous phone survey of approximately 450,000 residents asking about health conditions and risk factors. This survey is called the Behavioral Risk Factor Surveillance System (BRFSS). The 2018 BRFSS Codebook describes the 2018 version of a SAS dataset from a survey in the US that is conducted by phone every year.

The codebook describes specifications about each variable in the dataset, including the following:

Variable name
Allowable values
Frequencies in the dataset for each value

The BRFSS Codebook is quite extensive and can be confusing for an analyst without a background in the dataset to understand it. In Chapter 3, Helpful PROCs for Managing Data, we will look closely at an example from the BRFSS Codebook. For now, let's review a codebook that is easier for the beginner to interpret. Here is an example of a codebook entry from the online codebook for the US National Health and Nutrition Examination Survey (NHANES):

Figure 1.1 – Example of a codebook entry from the US NHANES

The following table represents how three of the variables in the BRFSS Codebook – _STATE, SEX1, and _AGE80 – could be represented in three lines of data:

Table 1.1 – Example of three variable values for three respondents in the 2018 BRFSS dataset

Here, the state of residence of the respondent is recorded under X_STATE according to its corresponding numerical Federal Information Processing System (FIPS) number, and SEX1 is coded as 1 for male and 2 for female, 7 for don't know/not sure, and 9 for refused (to decode state FIPS numbers, please see the link for the FIPS state codes list in the Further reading section). The _AGE80 variable refers to the age of the respondent imputed from other data (with ages over 80 collapsed). Using the codebook to decode the preceding data, we see the three rows represent a 72-year-old man from Florida (FL), a 25-year-old woman from Massachusetts (MA), and a 54-year-old woman from Minnesota (MN).

Let's look at an example of using data step code to create this table in SAS:

DATA THREEROWS;
    INFILE CARDS;
    INPUT _STATE SEX1 _AGE80;
    CARDS;
12 1 72
25 2 25
27 2 54
;
RUN;

Let's go through the code:

The THREEROWS dataset is created in the WORK directory of SAS. The WORK directory, simply called WORK, is the working directory for the SAS session, which means when the session is over, the data in WORK will be erased.
As is typical in SAS programming, each of the programming lines ends with a semi-colon, except each of the data lines.
The next line, INFILE CARDS;, indicates to SAS that data will now be entered from cards (although it is possible to replace this with the more modern version of the command, datalines).
When Barr designed this process, the next step would be for SAS to input punch cards that held the data. The next line, INPUT _STATE SEX1 _AGE80;, designates that the data that will be input from the cards has these headers: _STATE, SEX1, and _AGE80.
The next line, CARDS;, indicates that it is time for the cards to start to be read. What follows in the code is our modern representation of entering the data represented in the table into SAS using CARDS.
By the time SAS processes the CARDS statement, it already knows from the INPUT statement to expect three columns – _STATE, SEX1, and _AGE80 – so even without formatting the lines in three rows, SAS would read the values sequentially and assemble the dataset with three columns, ending when it hits the semi-colon at the end.
These three variables are numeric by default unless '$' is included in the INPUT statement.
Note:
It is not a good idea to store actual data values in SAS code today. They can easily be lost, and if the data is private, it can create privacy issues around the code. Further more, many of the datasets used today are extremely large, and it would not be practical to store them as actual data values in SAS; instead, they might be stored in a database system such as Oracle, or in an Excel file.

Early SAS data handling

In September 1966, the conceptual ideas behind SAS were presented by Barr and others to the Committee on Statistical Software of the University Statisticians of Southeast Experiment Station (USSERS) at their meeting held in Athens, GA. Barr began working with others, including the current SAS CEO, James Goodnight, on developing the first worldwide release of SAS.

Note:

The first worldwide release of SAS in 1972 consisted of 37,000 lines of code, 65% written by Barr, and 32% written by Goodnight.

Improvements implemented in the 1972 worldwide release of SAS focused on procedures known as PROCs. Procedures are applied to SAS datasets. Some PROCs are for data editing and handling, but most are focused on data visualization and analysis, with the data handling typically done using data steps. Barr developed some basic PROCs still used today to assist in data handling, including PROC SORT and PROC PRINT.

Let's look at an example of PROC SORT and PROC PRINT. Coming back to the data we entered earlier, the rows were sorted according to the value in the _STATE variable:

If we wanted to sort the dataset in order of the respondent's age, or _AGE80, we could use PROC SORT with the _AGE80 command.
Following that with a PROC PRINT would then print the resulting dataset to the screen.

This is shown in the following code:

PROC SORT data=THREEROWS;
by _AGE80;
PROC PRINT;
RUN;

While Barr also developed analysis PROCs such as PROC ANOVA, which conducts an ANOVA, Goodnight developed PROCs aimed principally at analysis, such as PROC CORR for correlations and PROC MEANS for calculating means. The most ideal situation at the time for data handling was to have the data already stored on the cards, essentially in the format it needed to be in for analysis. However, data step code was available for editing the data.

At this time, NIH discontinued funding the project, but the consortium of universities that had worked on the project agreed to provide funding support, allowing the programmers to continue building SAS. Barr, Goodnight, and others continued to develop the software, adding mainly statistical functions rather than data management functions, and released a 1976 version. For the 1976 version, Barr rewrote the internals of SAS, including the following:

Data management functions
Report-writing functions
The compiler

This was the first big rewrite of SAS's processing functions.

In 1976, SAS Institute, Inc. was incorporated, with ownership split between Barr, Goodnight, and two others:

Over 100 organizations including pharmaceutical companies, insurance companies, banks, governmental entities, and members of the academic community were using SAS.

More than 300 people attended the first SAS users conference in 1976.

Reflecting on this short history, it is understandable that even today, SAS maintains the reputation of being the only statistical software that can comprehensively handle big data. While in some regards this statement remains true, it is also necessary to revisit a more subtle point, which is that SAS was initially developed for data analysis – not for data storage. Even with improvements, SAS data handling is still limited by some of the features originally developed in this early era.

SAS data handling improvements

Both the SAS programs and the data SAS analyzed were initially stored on punch cards. These were physical cards with hole punches in them to indicate instructions to the computer. The following photograph shows a real punch card that was used for an IBM 1130 program:

Figure 1.2 – This card contains a self-loading w:IBM 1130 program that copies the deck of cards placed after it in the input hopper. Photograph by Arnold Reinhold, CC BY-SA 2.5 (https://creativecommons.org/licenses/by-sa/2.5/deed.en)

In his 2005 report titled Programming with Punch Cards, Dale Fisk explains how creating a set of punch cards to run a computer program was a multi-step, labor- and time-intensive process:

First, cards had to be punched by hand.
Next, the program punched into the cards had to be compiled through a computer, which would produce a printed list of errors if there were any.
If the program compiled, the computer would print a set of cards with the compiled program.
This new set of cards would be the ones used to launch the program.

SAS was the first application that could feasibly handle running programs to analyze large datasets using punch cards. The positive result of the development of early SAS programs was the ability to use these punch cards to run complex regressions that could never have been attempted before.

But punch cards also created challenges. The foundation SAS component, called Base SAS, was about 300,000 lines of code. This program was stored in 150 boxes of cards that would stand 40 feet high. These boxes were separate from boxes of cards of data that SAS would be used to analyze.

This meant that card storage was an issue. A lot of space was required already for the computers themselves, which could take up entire rooms. In addition, the cards were unwieldy and required their own set of error handling procedures, including the LOST CARD error that can still be displayed in SAS today under particular circumstances when there is an error reading in data.

Nevertheless, SAS continued to reach out and recruit new customers. From the beginning, SAS has always prided itself on its customer service. As of 1978, there were 600 SAS customer sites, but only 21 employees. The climate was that of everyone pitching in to help fill customer needs, and Goodnight was known for recognizing the value of employees to the company.

You're reading from Mastering SAS Programming for Data Warehousing An advanced programming guide to designing and managing Data Warehouses using SAS

Table of Contents (18) Chapters

Using original versions of SAS

Initial SAS data handling

Early SAS data handling

SAS data handling improvements

Authors (1)

Other recommended products

Personalised recommendations for you