Chapter 1 Introduction1.1Introduction to SAS Enterprise Miner.........................................................................1-3
1.1Introduction to SAS Enterprise Miner1-31.1Introduction to SAS Enterprise Miner2SAS Enterprise MinerThe SAS Enterprise Miner interface simplifies many common tasks associated with applied analysis. Itoffers secure analysis management and provides a wide variety of tools with a consistent graphicalinterface. You can customize it by incorporating your choice of analysis methods and tools.3SAS Enterprise Miner–Interface TourMenu bar and shortcut buttonsThe interface window is divided into several functional components. Themenu barand correspondingshortcut buttonsperform the usual windows tasks, in addition to starting, stopping, and reviewinganalyses.
1-4Chapter 1Introduction4SAS Enterprise Miner–Interface TourProject panelTheProject panelmanages and views data sources, diagrams, results, and project users.5SAS Enterprise Miner–Interface TourProperties panelTheProperties panelenables you to view and edit the settings of data sources, diagrams, nodes, results,and users.
1.1Introduction to SAS Enterprise Miner1-56SAS Enterprise Miner–Interface TourHelp panelTheHelp paneldisplays a short description of the property that you select in the Properties panel.Extended help can be found in the Help Topics selection from the Help main menu.7SAS Enterprise Miner–Interface TourDiagram workspaceIn thediagram workspace, process flow diagrams are built, edited, and run. The workspace is whereyou graphically sequence the tools that you use to analyze your data and generate reports.
1-6Chapter 1Introduction8SAS Enterprise Miner–Interface TourProcess flowThe diagram workspace contains one or more process flows. Aprocess flowstarts with a data sourceand sequentially applies SAS Enterprise Miner tools to complete your analytic objective.9SAS Enterprise Miner–Interface TourNodeA process flow contains several nodes.Nodesare SAS Enterprise Miner tools connected by arrowsto show the direction of information flow in an analysis.
1.1Introduction to SAS Enterprise Miner1-710SAS Enterprise Miner–Interface TourSEMMA tools paletteThe SAS Enterprise Miner tools available to your analysis are contained in thetools palette. The toolspalette is arranged according to a process for data mining, SEMMA.SEMMA is an acronym for the following:SampleYou sample the data by creating one or more data tables. The samples should be large enoughto contain the significant information, but small enough to process.ExploreYou explore the data by searching for anticipated relationships, unanticipated trends,and anomalies in order to gain understanding and ideas.
Data manipulation is an important part of the data mining process. Filtering data and removing inaccurate or skewed variables can be important to ensure that accurate analysis is completed. SAS® Enterprise Miner™ includes two nodes created specifically for the purpose of removing variables.
This tip focuses on two nodes used for filtering and removing variables and how they can be used:
- Drop Node
- Filter Node
The Drop Node
The Drop Node can be used to remove any unnecessary variables from the Enterprise Miner data sets. Any of the following role types can be dropped from scored data sets:
- Assess
- Classification
- Frequency
- Hidden
- Input
- Rejected
- Residual
- Target.
The Drop Node can be used within decision trees to trim the size of the data sets and metadata during the tree analysis.
The Drop Node can be found within the ribbon under the Modify tab.
The Drop Node can be dragged on to a SAS Enterprise Miner diagram and joined using an arrow to direct the flow of the data through the system:
The Drop Node allows you to specify the variables that you wish to remove from the SAS data set. This method has the following options available. To view the options available for the Drop Node, click on the Drop Node in the diagram and the properties will be displayed within the left pane.
By default, the ‘Drop from Tables’ attribute is set to ‘No’. This indicates that any variables that are selected to be dropped will be removed from the exported metadata only. If this value is set to ‘Yes’ then this node will create data sets instead of views for the data specified.
Within the ‘Drop Selection Options’ you can choose the type of variables that you would like to drop from the data analysis. This includes the data types below:
- Assess
- Classification
- Frequency
- Hidden *
- Input
- Predict
- Rejected *
- Residual
- Target
- Other
* Variables that have a role of Hidden and Rejected are dropped by default within the data set.
Within the Baseball data set the following roles have been set. On running the default settings within the Drop Node, we would expect that the logSalary variable would be dropped from the data set.
To run the Drop Node, right-click on the last node in the sequence and select run. A green-tick demonstrates that the node has run successfully:
On running the flow with the default settings, the following output log shows that one interval variable was discovered that had a role of rejected. This variable was removed from the data set.
The Filter Node
The Filter Node enables you to apply a filter to the training data set in order to exclude outliers or other observations that you do not want to include in your data mining analysis. Outliers can greatly affect modelling results and, subsequently, the accuracy and reliability of trained models.
Within SAS Enterprise Miner, the Filter Node can be found in the ribbon within the Sample tab.
The Filter Node can be dragged on to a SAS Enterprise Miner diagram and joined using an arrow to direct the flow of the data through the system:
The Filter Node can be used to remove any missing values, use normalised values or to customise the filtering method that you would like for both class and interval variables.
The ‘Export Table’ options allows you to specify which table to export after training the data set. This value can be set to one of the following:
- Filtered: The default option, this allows the filtered data to be passed through as a view for further processing.
- Excluded: This passes through any filtered out data as a view for further processing.
- All: This passes all of the data through as a view and creates an indicator variable to identify any filtered records.
The ‘Tables to Filter’ option allows you to specify if you would like just the training data set filtered or if you would like all data sets filtered.
The ‘Distribution Data Sets’ option allows you to specify if the data sets used for interactive filtering should be created a training time. These data sets are used for histograms and bar charts which you may use in further analysis.
Class variables, by default, are filtered by Rare Values (Percentage) with a minimum cutoff for percentage at 0.01%. This removes any class variables that are only discovered in less than 0.01% of the data. The default also keeps any normalised of missing class variable values.
Interval variables are filtered using Standard Deviations from the Mean, with missing values also being kept.
To run the Filter Node, right-click on the last node in the sequence and select run. A green-tick demonstrates that the node has run successfully:
Running the Filter Node using the default settings has allowed for 44 observations to be excluded for the training data set.
The class variables that have been removed are as below:
The limits that were used for the interval variables are also displayed in the results window: