SPSS Workshop

SPSS WORKSHOP
SPSS Workshop
Version :
Last Modification Date :
Owner :
1.0
10/October/2012
ISV & Developer Relations La Gaude
Emmanuel Genard ([email protected])
Benoit Barranco ([email protected])
Armand Ruiz ([email protected])
1
SPSS WORKSHOP
Table of Contents
Lab 1: Create an SPSS Model to Predict Insurance Risk
1. Load the historical date from a flat file .......................................................................................... 5
2. Define what we are trying to predict.............................................................................................. 6
3. Partitioning our data ....................................................................................................................... 7
4. Finding the best model ................................................................................................................... 8
5. Building the model and checking its accuracy .............................................................................. 9
6. Applying the model to new policy applications ........................................................................... 12
Lab 2: Retail Market Basket analysis
1. Understand the source data .......................................................................................................... 17
2. Prepare the data ............................................................................................................................ 19
3. Know our customers better. ......................................................................................................... 23
Lab 3: Text Analytics
1. Text Mining Nodes....................................................................................................................... 28
2. A Typical Text Mining Session .................................................................................................... 28
3. Import Data .................................................................................................................................. 29
4. Selecting A Set of Categories and Resources .............................................................................. 33
5. Extracted Results Pane................................................................................................................. 37
6. Data Pane ..................................................................................................................................... 38
7. Resource editor ............................................................................................................................ 41
8. Categorization .............................................................................................................................. 42
9. Generating a Text Mining Model ................................................................................................................... 48
2
Lab 1: Create an SPSS Model to Predict Insurance Risk
Lab 1: Create an SPSS Model to Predict Insurance Risk
Lab Objectives: In this lab we will use SPSS Modeler to predict which new home owner’s insurance applications will
be least risky, based on historical data so we can offer the applicants a policy discount.. We will first use SPSS
Modeler and historical data to create a model. We will then apply the model to new home owner’s insurance
applications to predict which applications will be least risky, and therefore can be offered a policy discount. You will
complete the following tasks:
i) Loading data into SPSS Modeler from a flat file.
ii) Defining the target of the model and excluding fields that take away from the model’s accuracy.
iii) Partitioning data into training and testing sets.
iv) Automatically finding the best algorithm for the model.
v) Comparing the results of different algorithms running against the same data.
vi) Checking the accuracy of a model.
vii) Exporting the results of a model.
Lab Duration : 40 minutes
1. Load the historical data from a flat file
In this section you’ll load historical data of homeowner policies and their history of claims from a comma separated flat file. The data include both policies where the policyholder has made claims and those
without claims. Our goal of the model is to predict which new policy applications will be least risky (i.e.
result in no claims) based on historical data so we can offer these applications a policy discount.
Step 1
Start SPSS Modeler from the Start menu (Start=>All Programs=>IBM SPSS Modeler
15.0=>IBM SPSS Modeler 15.0)
Step 2
Click on Create a new stream (top right hand side of dialog).
Step 3
Click the Sources tab and drag a Var File node onto the canvas
Figure 1 Canvas after Var File node added
Step 4
Double click on the Var. File node you just added to bring up its properties.
Step 5
Click on the button with three dots to locate the source file.
Copyright IBM Corporation 2011-2012. All rights reserved.
5
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 2 Locate source file
Step 6
Navigate to the folder C:\SPSS Workshop\Labs\Lab1 and select the file claimed.txt
Step 7
Click Open.
Step 8
Click Preview to see a selection of the data in the file.
Figure 3 Preview of data in source file
Step 9
Click OK to close the Preview window.
Step 10
Click OK to close the Var. File properties.
2. Define what you’re trying to predict
In this section you’ll indicate what field you’re trying to predict and what fields will be considered in
developing a predictive model. We can actually define the roles in the Var. File node - it is just
recommended to use the Type Node because you can use the same Var. File node in multiple branches,
or you can copy and paste it on the same canvas. If you use an external Type node then each instance
can be typed differently.
Step 1
Click the Field Ops tab and drag a Type node onto the canvas to the right of the Var. File node
you just added.
Figure 4 Dragging a Type node onto the canvas
Step 2
Select the Var. File node you added previously (should now be labeled claimed.txt), right click
and select Connect… from the context menu (alternatively you can press the F2 key)
Step 3
Click the Type node to complete the connection from the Var. File node. Your canvas should
now look like Figure 5.
Copyright IBM Corporation 2011-2012. All rights reserved.
6
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 5 Canvas after connecting Type node
Step 4
Double click on the Type node to bring up its properties.
Step 5
The last row represents CLAIMMADE a boolean value that indicates whether any claims have
been made against a particular policy. This is the field we’re trying to predict. Click in the Role
column for the last row and set the Role to Target (see Figure 6 ).
Step 6
The first row represents POLICYID which will not be a good predictor because its a randomly
generated unique number. Click in the Role column for the first row and set the Role to
Record ID (see Figure 6 ).
Step 7
Click in the Role column for the second to last row (CLAIMS) and change it to None (see
Figure 6 ). This value is perfectly correlated with the target so it will cause a trivial decision tree
to be generated if it is not excluded.
Figure 6 Type node properties
Step 8
Click Read Values (see Figure 6).
Step 9
Click OK to close the properties dialog of the Type node.
3. Partitioning the data
In this section you’ll partition the historical data into 2 randomly selected mutually exclusive groups. One
will be used to build a predictive model and the other will be used to test the performance of the model.
We partition the data set to be able to independently test the model with data where you already know the
answer to the thing you're trying to predict. In this case we're using half the historical data to build the
model and the other half to test it. If we were to use all the historical data just to build the model we won't
have a way to verify its accuracy right away.
Step 1
Click the Field Ops tab and drag a Partition node onto the canvas to the right of the Type
node you just added.
Step 2
Select the Type node you added previously right click and select Connect… from the context
menu (alternatively you can press the F2 key)
Step 3
Click the Partition node to complete the connection from the Type node. Your canvas should
Copyright IBM Corporation 2011-2012. All rights reserved.
7
Lab 1: Create an SPSS Model to Predict Insurance Risk
now look like Figure 7.
Figure 7 Canvas after Partition node added
Step 4
Double click the Partition node to bring up its properties
Step 5
Note that 50% of the data is being used to train the model and the other 50% is being used to
test the model. Let’s use the default for this. Close the Partition node properties by clicking
OK.
4. Finding the best model
At this time we don't really know which model we should use, i.e. which model is the most accurate.
Fortunately, SPSS Modeler has a special node that allows you to run multiple models against the same
data and show the overall accuracy of each of them. You will then be able to select the most appropriate
model for your data source.
Step 1
Select the Modeling tab in the palette.
Step 2
Select the Automated category at the left and then drag an Auto Classifier node onto the
canvas to the right of the Partition node you just added.
Step 3
Select the Partition node you added previously right click and select Connect… from the
context menu (alternatively you can press the F2 key)
Step 4
Click the Auto Classifier node (should now be labeled CLAIMMADE) to complete the
connection from the Partition node. Your canvas should now look like Figure 8
Figure 8 Canvas after Auto Classifier node added
Step 5
Double click on the Auto Classifier node you just added to bring up its properties.
Step 6
Select Overall accuracy next to Rank models by: (see Figure 9)
Figure 9 Auto Classifier node properties
Step 7
Click Apply and then click Run . A yellow nugget will be generated.
Step 8
Arrange the nodes on the canvas by right clicking on any unused area of the canvas and
Copyright IBM Corporation 2011-2012. All rights reserved.
8
Lab 1: Create an SPSS Model to Predict Insurance Risk
selecting Automatic Layout from the context menu.
Figure 10 Canvas after Automatic Layout
Step 9
Double click on the yellow model nugget. The three most accurate models are displayed with
the most accurate being C5.1. We will therefore use this one for the rest of the workshop.
Figure 11 Results of Auto Classifier
Step 10
Click OK to close the results of the model nugget.
5. Building the model and checking its accuracy
Now we know that the C5.1 model is the best one for our data set we can build the model using that
explicitly now knowing that we will get a very accurate model. We’ll also get some more details about the
accuracy of the model.
Step 1
Select the Modeling tab in the palette.
Step 2
Select the Classification category at the left and then drag a C5.0 node onto the canvas to the
right of the Partition node you added earlier.
Step 3
Select the Partition node, right click and select Connect… from the context menu
(alternatively you can press the F2 key)
Step 4
Click the C5.0 node (should now be labeled CLAIMMADE) to complete the connection from
the Partition node. Your canvas should now look like Figure 12.
Copyright IBM Corporation 2011-2012. All rights reserved.
9
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 12 Canvas after adding C5.0 node
Step 5
Double click on the C5.0 node to bring up its Properties and click Run
Step 6
Arrange the nodes on the canvas by right clicking on any unused area of the canvas and
selecting Automatic Layout from the context menu. The canvas should now look like Figure
13.
Figure 13 Canvas after running C5.0 and Auto Layout
Step 7
Double click on the C5.0 model nugget to see the results. You should see the top level decision
in the decision tree to the left of the window a bar chart of the importance of the various
predictors.
Copyright IBM Corporation 2011-2012. All rights reserved.
10
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 14 Predictor importance
Step 8
Click the Preview button. You should see 2 columns added to the results. $C-CLAIMMADE is
the prediction of whether each policy will have 1 or more claims and $CC-CLAIMMADE is the
confidence of that prediction.
Figure 15 Model predictions
Step 9
Click OK to close the Preview window.
Step 10
Click OK to close the model nugget properties.
Step 11
Select the Output tab in the palette.
Step 12
Drag an Analysis node just to the right of the C5.0 model nugget.
Step 13
Select the C5.0 model nugget , right click and select Connect… from the context menu
(alternatively you can press the F2 key)
Step 14
Click the Analysis node to complete the connection from the model nugget. Your canvas
should now look like Figure 16.
Copyright IBM Corporation 2011-2012. All rights reserved.
11
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 16 Canvas after Analysis node added
Step 15
Double click on the Analysis node to bring up its properties and then click Run.
Step 16
Your output should look like Figure 17. The model is 91.34 % accurate for the training records
(records used to build the model) and 90.77% accurate with the testing set, so we have a fairly
accurate predictive model which is about 91% accurate.
Figure 17 Output of Analysis node
Step 17
Click OK to close the Analysis node output.
6. Applying the model to new policy applications
We will use the best predictive model, in our case C5.0, against our new policy applications to identify
the lowest risk ones with the highest level of confidence.
Step 1
Select the Sources tab in the palette.
Step 2
Drag another Var. File node onto the bottom left of the canvas. Double click on the Var. File
node you just added to bring up its properties.
Step 3
Click on the button with three dots to locate the source file.
Step 4
Navigate to the folder C:\SPSS Workshop\Labs\Lab1 and select the file newquotes.txt
Step 5
Click Open.
Step 6
Click OK to close the Var. File properties.
Step 7
Single click on the new Var. File node you just added (should now be labeled newquotes.txt).
Copyright IBM Corporation 2011-2012. All rights reserved.
12
Lab 1: Create an SPSS Model to Predict Insurance Risk
Step 8
Double click on the C5.0 model nugget in the top right part of the window where all your model
nuggets are saved
Figure 18 Saved model nuggets
Step 9
Your canvas should now look like Figure 19.
Figure 19 Canvas after adding newquotes.txt file
Step 10
Drag a Table node from the Output palette to the right of the model nugget you just added.
Step 11
Select the newly added model nugget, right click and select Connect… from the context menu
13
Copyright IBM Corporation 2011-2012. All rights reserved.
Lab 1: Create an SPSS Model to Predict Insurance Risk
(alternatively you can press the F2 key)
Step 12
Click the Table node to complete the connection from the model nugget. The bottom part of
your canvas should now look like Figure 20.
Figure 20 Canvas after adding Table node
Step 13
Double click on the Table node to bring up its properties and then click Run.
Step 14
You should see 2 columns added to the results. $C-CLAIMMADE is the prediction of whether
each policy will have 1 or more claims and $CC-CLAIMMADE is the confidence of that
prediction.
Step 15
Click OK to close the output of the Table node.
Step 16
Drag a Select node from the Records Ops palette to directly below the Table node just added.
Step 17
Select the newly added model nugget, right click and select Connect… from the context menu
(alternatively you can press the F2 key)
Step 18
Click the Select node to complete the connection from the model nugget. The bottom part of
your canvas should now look like Figure 21.
Figure 21 Canvas after Select node added
Step 19
Double click on the Select node to bring up its properties .
Step 20
Click on the button shown in Figure 22 to launch the Expression Builder
Figure 22 Button to launch Expression Builder
Step 21
On the right double click the field $C-CLAIMMADE to place it in the expression
Step 22
Next type =’N’ followed by a space, followed by and then another space
Copyright IBM Corporation 2011-2012. All rights reserved.
14
Lab 1: Create an SPSS Model to Predict Insurance Risk
Step 23
Double click the field $CC-CLAIMMADE to place it in the expression
Step 24
Next type >= 0.90
Step 25
The full expression should now be '$C-CLAIMMADE'='N' and '$CC-CLAIMMADE' >= 0.90
Step 26
Click Check to verify the expression.
Step 27
Click OK to close the Expression Builder
Step 28
Click OK to close the Select node properties
Step 29
Drag a Sort node from the Records Ops palette the right of the Select node just added.
Step 30
Select the Select node, right click and select Connect… from the context menu (alternatively
you can press the F2 key)
Step 31
Click the Sort node to complete the connection from the Select node. The bottom part of your
canvas should now look like Figure 23.
Figure 23 Canvas after adding Sort node
Step 32
Double click on the Sort node to bring up its properties.
Step 33
Click on the button to select fields
Figure 24 Button to select fields
Step 34
Select the field $CC-CLAIMMADE and click OK
Step 35
Change the Order to Descending
Step 36
Click OK to close the Sort node properties.
Step 37
Drag a Flat File node from the Export palette the right of the Sort node just added.
Step 38
Select the Sort node, right click and select Connect… from the context menu (alternatively
you can press the F2 key)
Step 39
Click the Flat File node to complete the connection from the Sort node. The bottom part of
your canvas should now look like Figure 25.
Copyright IBM Corporation 2011-2012. All rights reserved.
15
Lab 1: Create an SPSS Model to Predict Insurance Risk
Figure 25 Canvas after adding Flat File node
Step 40
Double click on the Flat File node to bring up its properties.
Step 41
Click on the button with three dots and navigate to c:\temp.
Step 42
Name the output file lowrisk.txt and click Save.
Step 43
Click Run
Step 44
Save your model by selecting File->Save
Step 45
Exit Modeler.
Congratulations! You’ve successfully built a model to predict risk for new applications for homeowner’s
insurance .
Copyright IBM Corporation 2011-2012. All rights reserved.
16
Lab 2: Retail Market Basket analysis
Lab 2: Retail Market Basket analysis
Lab Objectives : in this lab your mission is to help a marketing team to increase the shopping cart.
To do so they are willing to better understand the product sales patterns and customer profiles and
preferences. The lab will guide you through two product affinity scenario:
1. First you’ll analyze transactional and demographics details to find out who is buying the
different product associations.
2. Second, still using transactional data, you’ll try to figure out what kind of recommendations
would likely to be accepted by the customer base.
Lab Duration : xx min
1. Understand the source data
 Open SPSS Modeler by clicking Start > All Programs > IBM SPSS Modeler 14.2 > IBM
SPSS Modeler 14.2
 On the welcome screen, select Create a new stream and click Ok.
 Drag a Var. File node onto the stream canvas.
 The source data is read from a comma-delimited source file. Double-click the file node to
open the Var. File node properties. In the File tab, click the Elipsis… and point to the source
file: C:\SPSS_Workshop\Labs\Lab1_MBA\Sources\BASKETS1n. Keep the default options,
click Apply.
Copyright IBM Corporation 2011-2012. All rights reserved.
17
Lab 2: Retail Market Basket analysis
5. Click on Preview to see a snapshot (10 first records) of the data. Close the preview window.
Click Ok to close the Var. File node properties window.
The data set contains 18 fields with each record representing a basket (a shopping visit). Notice
that within each transaction, the quantities purchased are ignored and a flag represent a
purchase (or not) for each product category (wine, fruit and vegetables, fish, frozen meals…). In
addition to the transaction details, data contains basket summary information (cardid: loyalty
card identifier for customer purchasing this basket, value: total purchase price of basket,
pmethod: method of payment for basket) and personal details of the cardholder (sex, homeown:
whether or onot cardholder is a homeowner, income and age).
 Drag and drop a Type node from the Field Ops palette. Connect it with the BASKETS1n
node.
 Double-click on the Types node and click Read Values button. Click Ok if a window
popups. Notice the measurement level changes for of the fields.
8. Change the cardid field to Typeless as this is a key field. Change the sex field to Nominal
(this is to ensure that Apriori modeling algorithm will not treat sex as a flag). Click Ok.
Copyright IBM Corporation 2011-2012. All rights reserved.
18
Lab 2: Retail Market Basket analysis
Information : Measurement levels (formely known as “data type” or “usage type”)
describe the usage of the data fields. Note that it is different from the storage type
which indicates whether data are stored as a string, integer, date…
The following measurement levels are available:

Continuous: used to describe numeric values such as range of 0-100.

Categorical: used for string values when an exact number of distinct values is unknown.
This is an uninstantiated data type, meaning all possible information about the storage and usage
of the data is not yet known. Once data is read, the measurement level will be Flag, Nominal our
Typeless.

Flag: used for data with two distinct values that indicate the presence or absence of a trait,
such as True or False.

Nominal: used to describe data with multiple distinct values, each treated as a member of
a set, such as small/medium/large.

Ordinal: used to describe data with multiple distinct values that have an inherent order.
For example, salary categories or satisfaction rankings can be typed as ordinal data.

Typeless: used for data that does not conform to any of the above types, for fields with a
single value, or for nominal data where the set has more members than the defined maximum.
9. From the Output palette, drag and drop a Table node and connect it to the Type Node.
10. From the toolbar, click the Run button.
11. Save the stream as bask. Organize your streams in a project by dragging and dropping the
stream onto the Data Understanding folder of the CRISP-DM project explorer.
Copyright IBM Corporation 2011-2012. All rights reserved.
19
Lab 2: Retail Market Basket analysis
2. Prepare the data
 From the Graphs palette, drag and drop a Web node onto the stream canvas. Connect the
Type node to the newly added Web node.
 Double-click on the Web node to open the node properties. Add the eleven flag fields that
represent purchases per product category.
 Check the ‘Show true flags only’ as we want to highlight the different associations between
the products.
 Navigate through the various tabs to see the available options of the Web node.
Web node shows the strength of relationships between values of two or more symbolic fields. The
graph displays connections using varying types of lines to indicate connection strength. You can
use a Web node, for example, to explore the relationship between the purchase of various items at
an e-commerce site or a traditional retail outlet.
 Run the Web node by clicking the run button.
Copyright IBM Corporation 2011-2012. All rights reserved.
20
Lab 2: Retail Market Basket analysis
 Because most combinations of product categories occur in several baskets, the strong links on
this web are too numerous to show the groups products bought together. From the Web
output window, click the
to show the web summary on controls. From the Controls tab,
select the ‘Size shows strong/normal/weak’ radio button. Use the sliders to set the ‘Strong
links above’ 100 and ‘Weak links below’ 90. This gives you a visual look at which products
sell together, based on the data set, there are 3 strong links (see screenshot on next page).
 Click Ok to close the Web output.
 From the Modeling palette, drag and drop a Apriori node onto the stream canvas. Connect
the Type node to the Apriori node.
Information : Product affinity is performed by using association detection algorithms
which provides rules describing which fields typically occur together. IBM SPSS
Modeler contains three different algorithms that perform association detection: Apriori,
Carma and Sequence.

Apriori has a slightly more efficient approach to association detection but has the limitation
that it only accepts categorical fields as inputs. It also contains options that provide choice in the
criterion measure used to guide rule generation.

Carma, in contrast to Apriori, offers build settings for rule support (support both for
antecedent and consequent) rather than antecepdent support. Carma also allows rules with
multiple consequents, or outcomes.
Copyright IBM Corporation 2011-2012. All rights reserved.
21
Lab 2: Retail Market Basket analysis

Sequence is based on the Carma association rules algorithm, which uses an efficient two-
pass method for finding sequences.
 Notice the Apriori node is named ‘No Targets’. This is because we need to specify what fields
need to be predicted by the model. Double-click the Type node to edit properties. Select
fields from cardid to age and set their role to None. Set fruitveg to confectionery fields role
to Both (meaning that the filed can be either an input or an output of the resultant model).
Click Ok to close the Type properties.
 Double-click the Apriori node named 11 Fields. Click the Model tab, make sure the Use
partition data is selected and that settings are as followed:
Information : The Use portioned data option splits the data unto separate Training,
Testing (and Validation) samples. Because we’re not using the Partition node this
option will be ignored. By default the Apriori will produce rules that have a minimum
support of 10% of the sample, and a minimum confidence of 80%.
 Click Run. An Apriori generated model node will appear in the Models palette and on the
stream canvas.
 Right-click the Apriori node in the Models palette of the Manager Window, then click
Browse. The Apriori algorithm has found only three association rules displayed by
Copyright IBM Corporation 2011-2012. All rights reserved.
22
Lab 2: Retail Market Basket analysis
Confidence %. In addition to consequent and antecedent, Support% and Confidence% are
displayed. The first rule tells us that on 17% of the records (shopping visits), beer and frozen
meal were purchased. Of this group, 85.882% also bought canned vegetables.
 Additional measures can be displayed. Click the Show/hide criteria menu button. Click
Instances. Repeat this step to display Lift.
 We now see that beer and frozen meal were purchased on 170 shopping trips. As there are
1000 records in the sample file, so this is (170/1000)*100, or 17.0% of the total shopping
trips (Support%).
 The Lift value is the expected return resulting from using the model or rule. Here it is a ratio
of the confidence to the base rate of consequent. The lift from the first rule is 2.834 means
that the chances of buying canned vegetables almost triple when beer and frozen meals are
purchased together.
Information : The findings we obtained algorithmically using the Apriori model look
about the same as those obtain visually using Web graph, that is Canned Vegetables,
Beers and Frozen Meals are often bought together. Based on the findings our
demographics is divided into three categories:
1. Those who buy Fruits and Vegetables along with Fish, who might be called Healthy
Eaters.
2. Those who buy wine and confectionery.
3. Those who buy Beer, Frozen meals and Canned Vegetables.
Now that we have identified three groups of customers based on the types of products
they buy, we would also like to know who these customers are, that is their
demographic profile.
Copyright IBM Corporation 2011-2012. All rights reserved.
23
Lab 2: Retail Market Basket analysis
3. Know our customers better.
We will derive a flag for each group. That is we want to have additional fields in our data that
indicates where a customer is part of a group or not.
1. From the Outputs tab in the Manager Window pane, Display the Web of 11 Fields which you
executed in Task 1.
2. Using the right mouse button, click the link between fruitveg and fish to highlight it.
3. Click Generate Derive Node For Link. This will add a derive node onto the stream canvas
named T_T.
4. Edit the resulting derive node to change the derive field name to Healthy.
5. Repeat those steps with the link from wine to confectionery naming the resultant derive node
Wine_Chocs.
6. The remaining group is composed of three links. From the toolbar, click View>Explore
Mode. Holding down the shift key select all three links in the cannedveg, beer and
frozenmeal triangle.
7. From the web display menus select Generate>Derive Node (“And”).
Copyright IBM Corporation 2011-2012. All rights reserved.
24
Lab 2: Retail Market Basket analysis
8. Rename the resulting derive node beer_beans_pizza.
9. To profile these customer groups, connect the existing Type node to these three Derive nodes
in series.
10. Attach to each derive node another Type node as shown in the following screenshot.
11. For each of the new Type nodes, set the role of all fields to None, except for the values,
pmethod, sex, homeown, income and age which should be set to Input.
12. Set the role of the relevant customer group (for example, beer_beans_pizza) to Target.
Copyright IBM Corporation 2011-2012. All rights reserved.
25
Lab 2: Retail Market Basket analysis
13. From the Modeling palette, Attach a C5.0 node (used for rule induction) to each of the Type
nodes to build rule-based profiles of these flags.
14. Set the output type for each C5.0 Node to Rule set.
15. Run each model.
16. Each resultant model contains a clear demographic profile for the group. Let’s analyze the
beer_beans_pizza model by right-clicking the nugget from Models tab and select Browse.
From the Model tab, expand Rules for T.
17. The above rule means that customer buying beers, pizzas and canned vegetables at mostly
male with a low income. The predictor importance shows which fields were considered as the
most important to build the rule-based profile.
Copyright IBM Corporation 2011-2012. All rights reserved.
26
Lab 2: Retail Market Basket analysis
18. Looking at the demographic that buys wine and confectionery, we realize that their profile is
mostly female with a higher income. Value is also part of the predictor, meaning that shopping
basket amount is higher for those customers.
In the retail domain, such customer groupings might, for example, be used to target special
offers to improve the response rates to direct mailings or marketing actions or to customize
the range of products (the assortment) stocked by a branch to match the demands of its
demographic base.
Copyright IBM Corporation 2011-2012. All rights reserved.
27
Lab 3: IBM SPSS Text Analytics
Lab 3: IBM SPSS Text Analytics
Lab Objectives : Text mining in Modeler is accomplished via several nodes that are specialized for
handling text. We begin this chapter by briefly discussing each node to familiarize you with them and
the text-mining environment in Modeler. We will then run through a simple text-mining example that
will allow you to see how the Text Mining node operates, the extraction of concepts from text, the
interactive workbench environment, and the grouping of these concepts into higher-levels categories.
This example will provide a practical foundation for the remainder of the course.
It is based in IBM Training of SPSS Analytics – Cours 0A003FR
Lab Duration : 60 min
1. Text Mining Nodes
Text Analytics is an add-on option that includes five nodes that read or process text. These nodes are
contained in their own Text Mining Palette.

Importing Text
Text can be stored in a database, Statistics file, or a spreadsheet and imported into Modeler for
processing, just as with structured data. Two nodes allow you to import text from other commong
source for text mining.
-The File List source node generates a list of document names as input to the text mining
process. This node is used when the text resides in external documents rather than one or
more fields in a database or other file. The node outputs a single field with one record for
each document or folder listed.
-The Web Feed source node reads in text from Web feeds, such as blogs or nexs feeds in
RSS or HTML formats. The node outputs one or more fields for each record found in the
feeds.

Translation
Copyright IBM Corporation 2011-2012. All rights reserved.
28
Lab 3: IBM SPSS Text Analytics
Modeler has the ability to translate text from a variety of languages into English through the use of
Language Weaver.
-The Translate node is used to translate text, from either fields or documents, from several
supported languages, such as Arabic, Chinese, and Russian, into English for modeling. This
makes it possible to mine text in a language even if analyst are unable to read that language.
The same functionality can be invoked from the text modeling nodes, but use of a separate Translate
node makes it possible to cache and reuse a translation in multiple nodes.

Text Mining and Analysis
Modeler includes two nodes to mine text to extract concepts and to find links between them.
-The Text Mining node ses linguistic methods to extract key concepts from the text,
allows you to create categories with these concepts and other data, and offers the ability
to identify relationships and associations between concepts based on known patterns
(called text link analysis). The node can be used to explore the text data, or to produce either a
concept model or category model, and then use that generated model information in subsequent
modelling.
-The Text Link Analysis node extracts concepts and also identifies relationships
between concepts based on known patterns withing the text. Pattern extraction can be
used to discover relationships between your concepts, as well as any opinions or
qualifiers attached to these concepts.

Viewing Text within Files
The File Viewer node allows you to view text contained in documents from within Modeler.
-The File Viewer node provides you with direct access to your original text data in
documents, however you used that data in a stream. The node can help you better
understand the results from text extraction by providing you access to the source text
from which concepts were extracted, since it is otherwise inaccessible in the stream.
2. A Typical Text Mining Session
Conducting a text mining analysis comprises these essential steps:
Copyright IBM Corporation 2011-2012. All rights reserved.
29
Lab 3: IBM SPSS Text Analytics

Import: Text data are read into Modeler, either in a field, from documents, or from a web
feed.

Extraction: An extractor engine automatically locates and collects the key terms from the text
data. It also collects these terms into higher-level groups. Types are collections of similar
terms, such as organizations, products, or positive statements. Patterns are combinations of
terms and types that represent qualifiers, such as positive comments about an organization.

Categorization: Using linguistic methods, co-ocurrence rules, or a standard term frequency
approach, categories are created from the extracted results. The categories represent higherlevel concepts that capture the chief ideas and key information in the text data.

Model Generation: Once categories are created, a text mining model node can be generated
that allows text data to be scored so that text data can be turned into structured information,
typically flag fields indicating wheter a concept is expressed in the text in a record, document,
etc.

Editing Resources: Although the four steps above are sufficient to conduct text mining, they
are usuarrly not adequation. Invariably, you will edit the dictionary ressources supplied with
Modeler, adding information and making modifications to insure that the appropiate terms are
extracted and categories created; This editing is done using the interactive workbench. Editing
is an iterative process, and it can occur at any point after data have been imported.
We will follow all these steps in the following Lab.
3. Import Data
We will launch Modeler and begin a new stream. For this example, we use data file. We don't provide
much background information here about this data except to note that the responses come from calls
with problems or complaints to a call center for a telecommunications firm. The data were recorded
from two months of calls and are a sample of all the calls received in that period. The data are
contained in the text file:
C://SPSS Workshop/Labs/Lab2 - Text Analytics/Source/Data/Astroserve0304.txt
-Add a Var. File node from the Sources palette to the stream canvas.
-Edit the Var. File node
-Select the file Astroserve0304.txt from the
C://SPSS Workshop/Labs/Lab2
- Text
Analytics/Source/Data/ directory
Copyright IBM Corporation 2011-2012. All rights reserved.
30
Lab 3: IBM SPSS Text Analytics
-Click the Tab check box in the Delimiters area, and deselect the Comma check box
-In the Quotes area, select Include as text for Single quotes and select Discard for Double quotes:
How quotes are handled can be critical when text is read. Also, since text data often contains
commas, normally a comma is not used as the delimiter character.
To preview the data, click on the Preview button.
Copyright IBM Corporation 2011-2012. All rights reserved.
31
Lab 3: IBM SPSS Text Analytics
There is an ID field for each customer call (Query_ID), day and month fields, and the text field itself
(query). The text can be quite lengthy, and the entries are of varied length. Note that there are spelling
errors in the text (lightening and recieved), abbreviations (cust and i/net), different date formats, and
special terms specific to this organization or industry (ntu and DTP). Sometimes the linguistic
resources will automatically handle these situations, but often some editing of the dictionary resource
will be necessary.
Next we'll add a text mining modeling node to the stream.
-Close the Table window
-Add a Text Mining node from the Text node to the stream
-Connect the Text Mining node to the Var. File node
-Edit the Text Mining node
Information : You may have noticed a delay when adding the Text Mining node to the
stream. The first time you add a Text mining node to the stream in a Modeler session,
the software has to install resources to be available to the node when executed. This makes the
node “heavy” and requires more loading time.
Data can be stored in a separate text field, as in the current data file, or in separate documents. When
in documents, the path-name, or location, of the documents, is specified along with other settings
appropriate to the text and its format.
-Click the Field Chooser button
and select query as the Text field.
Copyright IBM Corporation 2011-2012. All rights reserved.
32
Lab 3: IBM SPSS Text Analytics
-Click the Model tab
Text mining can be executed in two different modes from this node. The default is Build interactively
(category model nugget), which will open a special interactive environment in which you can
perform exploratory analysis (such as clustering and text link analysis), edit the linguistic resources,
Copyright IBM Corporation 2011-2012. All rights reserved.
33
Lab 3: IBM SPSS Text Analytics
create categories, either automatically or manually, or otherwise fine-tune the text mining results.
Alternatively, the Generate directly (concept model nugget) selection will build a model
automatically, using the node settins and the linguistic resources that are selected (in the Resource
template area). We will use the interactive workbench in this example.
-Click on the Exploring Text link analysis (TLA) results
4. Selecting A Set of Categories and Resources
In the final step in the New Project Wizard, you can select a text analysis package (TAP) that
contains the linuistic resources and categories that best fit your text data. A text analysis package
(TAP) includes the linguistic resources -the libraries, dictionaries, and leralted resources- which are
used to extract key concepts from the text. Also included are one or more category sets, which each
contain an enhanced code frame.
In Text Analytics for Modelers there are several pre-built TAP files for English language text. Each
TAP file shipped with this product is fine-tuned for a specific type of survey, such as employee,
product, and customer satisfaction surveys. You can also save your own custom TAP created from a
combination of shipped resources and changes and additions that you make. You can then choose this
custom TAP for a new project.
Even if you don't have survey data that exactly fits this types, the TAPs may be helpful. You just need
to broaden the view of “product” or “customer” when thinking about your data. For example, course
evaluation surveys are essentially product evaluation surveys where the product is the course.
-Click on the Text analysis packgage option under Copy Resources From
-Click on the Load button
-Click on the Customer_Satisfaction.tap
-Click on the Mixed Opinions option under Category Sets and Click on the Load button
Copyright IBM Corporation 2011-2012. All rights reserved.
34
Lab 3: IBM SPSS Text Analytics
We will make one change to the Expert setting.
-Click the Expert tab
-Click the Accomodate spelling for a minimum root character limit of check box
The Expert tab contains options to control how text is extracted, and many of the choices deal with
problems in text, such as spelling and punctuation errors. The option to try to fix spelling errors is
turned off by default, but we'd like to use it for this text because text entry was rather haphazard and
done on the fly by customer service representatives.
Copyright IBM Corporation 2011-2012. All rights reserved.
35
Lab 3: IBM SPSS Text Analytics
We have now made all the necessary selections to execute the Text Mining node:
-Click Execute
When the node is executed, two things happen. The interactive workbench environment opens, and
text extraction commences. There is a text extraction progress window that list the teps and progress
in the text extraction process.
Information : The first time a Text Mining executes in a Modele session, it will take
longer thant subsequent executes, even with the same settings. When it first executes,
the template to be used for text mining has to be loaded.
When extraction is complete, the extracted results are displayed in the interactive workbench
window.
Copyright IBM Corporation 2011-2012. All rights reserved.
36
Lab 3: IBM SPSS Text Analytics
-Click on the View button in the upper right hand corner of the Interactive workbench and select
Categories and Concepts to see the extracted concepts and categories.
Copyright IBM Corporation 2011-2012. All rights reserved.
37
Lab 3: IBM SPSS Text Analytics
The interactive workbench window has four different views for different types of analysis or for
editing the dictionary resources. The default view is Categories and Concepts, and in this view there
are four panes:

Extracted Results pane: Located in the lower left corner, this pane is the area in which you
perform an extraction and where the results can be edited and refined. The extracted concepts,
or terms, are listed, along with their type and frequency in the text.

Data pane: Located in the lower right corner, this pane is initially blank and is uised to
present the text data corresponding to selections in the other panes. The text is not displayed
automatically but will be displayed when the Display button in the Extracted Results or
Categories pane is clicked.

Categories pane: Located in the upper left corner, this area presents a table of the cateogires
that have been created along with their frequency in the text. The categories can be managed
and edited from this pane.

Visualization pane: Located in the upper right corner, this pane provides various graphical
representations of categorization (and will provide other types of visualization for other
views). The graphs include a bar chart of categories, a web graph showing category
relationships, and a table displaying the same in a more traditional format. The visual display
corresponds to what is selected in the other panes.
5. Extracted Results Pane
The default display in this pane shows the extracted concepts; These are single words, such as
complaint or compound words, such as phone service. All words are displayed in lower case. The
concepts are ordered by their document frequency (Docs column), which is the number of
documents/records in which they appear. Thus, the concept service appear in 1,221 records, which is
23% of the total number of customer calls. A second column listing frequency is headed “Global”
and represents the number of times a concept appears in the entire set of documents or records. If a
concept appears three times in one record, it is counted three times for the Global frequency. Thus,
the concept service appears 1,745 times, which is 2% of the total number of occurrences of all
concepts.
When concepts are extracted, they are assigned a Type to help group similar concepts. They are color
coded according to their type. We see that all the concepts visible have a type of Unknown and have a
Copyright IBM Corporation 2011-2012. All rights reserved.
38
Lab 3: IBM SPSS Text Analytics
blue color. There are several built-in types delivered with Text Mining for Modeler, such as Location,
Product, Person, Positive (qualifiers) and Negative (qualifiers).
The most frequent concepts are cust and customer, which is expected for call center data from
customers. The first interesting concetps that may be important for text mining that are visible are
probably service, complaint, and fault.
There are two views in the Extracted Results pane, Concepts and Type. Let's look at the types next.
-Click the View selection
button and select Type.
The display is similar to that for concepts, with frequencies by both documents and globally. By far
the most frequent type is Unknown, and you can expect this to be true the first time you extract text
data using the default linguistic resources. Even so, Modeler has found 2,351 occurrences of dates,
1814 occurrences of persons, and 5,653 occurrences of budget related items times in the text, among
other types.
Copyright IBM Corporation 2011-2012. All rights reserved.
39
Lab 3: IBM SPSS Text Analytics
6. Data Pane
If we wonder about the specific terms grouped under a type, we can view the original text data in the
Data Pane. Let's do that for the Organization type.
-Click
the
<Organization> type to select it
-Click the Display button
The Data pane displays one row per document or record corresponding to your slection in another
pane (in this instance, the Extracted Results pane). If the text data is realtively short in length, the text
field displays most or all of the text data. But the call center records can be quite lengthy, and then
the text field column shows a short portion of the text and there is a Text Preview pane to the right
that displays more of the text. This may not be visible immediately. To see it:
-Maximize the interactive workbench window
-Click on the second text record to select it
-Move the pane divider between the two columns so both are visible
The word or words that were extracted and placed in the Organization type are highlighted in yellow.
For the second text entry, the firm name nokia was extracted from this call center entry.
Copyright IBM Corporation 2011-2012. All rights reserved.
40
Lab 3: IBM SPSS Text Analytics
In the text data, all words in color have been extracted, so you can see that a large portion of the text
in this record was extracted. All words in balck were no textracted. The words that are not extracted
are often connectors, verbs or pronouns. These words are ussed during the extraction to make sense
of the text but are not temrs in themselves. Some words that could be vers, such as repair have been
extracted, but in the context of the text, repair is a noun, not a verb. This is a result of the natural
languange processing used by Modeler.
-Click the dropdown list in the Extracted Results pane and select Concept.
The Extracted Results can be sorted by the contents of any column, in alphabetic order or by
frequency. To sort, repeatedly click in a column header to select the sort order you prefer. We want to
order the extracted concepts in reverse alphabetical order.
-Click once in the Concept column header until you see this display
-Scroll down until you see the word mobile in the entries
There are many separate concepts that begin with the word mobile. In the extraction process,
although these terms are realted, as they all refer to mobile (cell phone) use or service, they are keep
separate. This is to allow maximum flexibility, but also because text extraction is, in reality, a twostep process. First the program must find the meaningful information in the text. Second, the program
Copyright IBM Corporation 2011-2012. All rights reserved.
41
Lab 3: IBM SPSS Text Analytics
can then group together related concepts (the process of categorization). We”ll see what Modeler
handled these concepts when we exaine the categorization results.
Before doing so, we brifly review some of the resources that are used in extracting and typing terms.
7. Resource editor
The Resource Editor in the interactive workbench allows you to edit and refine the linguistic
resources to tune the dictionaries for the specific type of text you are mining.
-Click the View
button in the upper right hand corner of the interactive
workbench
select Resource Editor
and
If necessary, click Customer Satisfaction Opinions in the library tree in the upper left corner
The Resource Editor window displays four panes. The top left pane shows libraries contained within
the Customer Satisfaction Opinions Text Analysis Package (TAP). There is always a Local library in
an interactive workbench session, by default, although it is emplty when you begin.
-Click on the Core Library (Engligh)
Copyright IBM Corporation 2011-2012. All rights reserved.
42
Lab 3: IBM SPSS Text Analytics
The Type pane is located in the upper center section of the window and siplays the types and
associated terms. We see that great britain and los angeles are associated with the Location type and
are in the Core library. You can scroll through the type window to see the various terms and types
(notice that all text in lower case).
It probably seems odd that only a few locations are listed, but this is because the vast majority of
supplied type information is in compliled resources that are not visible to the user. Thus Modeler is
able to successfully recognize thousands of geographical locations from the compliled Location type
resources. You only need to add other locations if the are not recognized in extraction.
The lower pane shows synonyms that are applied to handle words with the same meaning, plural
forms, and spelling variants. The Target term will be displayed and used in the extracted results, and
any of the Synonyms found will be replaced by the target. Invariably, you will make changes to the
synonyms to tune the results for you specific text.
The fourth pane on the right hand side is the Excluded pane and lists words that are not to be
extracted, usually because they are not meaningful or they add clutter to the resutls. In the call center
data, the words cust and customer fall into this category and, we may want to consider excluding
them from extraction.
8. Categorization
Once you have extracted terms, the next step is to categorize the responses using various automatic
methods. Categories are higher-level concepts that represent higher-level ideas and information in the
text. They can also represent all those terms that use certain workids, such as the terms using mobile.
Categories can also represent a single concept depending on the choice of method and settings. Some
concepts are important enough, or unusual enough, that they represent a distinct category. Category
names are based on the key concept(s) extracted from the set of text.
Because we use the Customer Satisfaction Opinions TAP, categories were already included.
Copyright IBM Corporation 2011-2012. All rights reserved.
43
Lab 3: IBM SPSS Text Analytics
The Docs column now has an icon with two arrows in each cell for each category rather than a
document count. This is because the data must be scored to determine which records contain which
categories.
-Click the Score button
Another progress window appears (not shown here)
After scoring the results, we can see the number of records for each category listed. The category
labeled Uncategorized lists how many responses are uncategorized (here 1150). The category names
have the type of the key concept added as a prefix. Hence, the category “Pos: Product:
Desing/Feature” is a category created from concepts that capture comments from customers who
indicate that they are generally satisfied with the design and features of the various products they
have.
Copyright IBM Corporation 2011-2012. All rights reserved.
44
Lab 3: IBM SPSS Text Analytics
Let's look at one of the categories in more detail.
-Double-click on the category Neg: Product: Functioning
The symbol
by each component of this category indicates that each of them is a rule. Rules
themselves consist of two or more concepts, types, or patterns. For example, there is a rule linking
clarification of issues for the customer and non-positive comments:
Copyright IBM Corporation 2011-2012. All rights reserved.
45
Lab 3: IBM SPSS Text Analytics
[<*clarification* & !(<Positive>)]
The excalamtion point is the not symbol, so this rule says to place reponses in this
category if they mention wait time and a non-positive comment. If you click on this rule
and then click the button, you can see some of the responses associated with it in the Data pane. The
relevant text used for categorization will be highlighted.
Another category that was automatically created was Neg: Pricing and Billing. This category could
potentially be ideal for identifying customers who have concerns about the company's billing
practices. In addition, we can use the Visualization pane to get and indication of what other
categories occur commonly with this category. A Category Bar chatt will provide exactly this
information.
-Click the Neg: Princing and Billing category in the categories list
-Click the Category Bar tab in the Visualization pane (if necessary)
-Click the Display button in the Categories pane
After a few moments, a bar chats appears in the Visualization pane. You may need to adjust comumn
sizes so that the full Docs column count is visible.
Copyright IBM Corporation 2011-2012. All rights reserved.
46
Lab 3: IBM SPSS Text Analytics
This chart list the overlap between categories for the 528 records that contains terms in the Neg:
Pricing and Billing category. We see that the Neg: Product: Functioning category occurs in 181 of
the 363 records, or 34..3%. This type of information can be very helpful in either a) understanding a
category or b) thinking about other categories that might be created (by grouping two, or more,
categories that occur together frequently).
9. Generating a Text Mining Model
The last step in the text mining modeling process is the creation of a generated text mining model
that can be used to score (categorize) data. After scoring data, the text information can be combined
with structured data to create a variety of models.
We will generate a model, browse it, and then add it to the stram to review its output.
Models are created from the Generated menu in the interactive workbench.
-Click Generate...Generate Model
-Click File...Close to close the interactive workbench
Modeler provides you with an option to save the interactive session so you can return to it in the
current state, exit the session without saving, or close the window but keep the session available. For
this example, we'll simply exit.
-Click Exit button
You will be returned to the Modeler environment.
-Rught-click on the generated model named query in the Models manager area
-Click Browse
Copyright IBM Corporation 2011-2012. All rights reserved.
47
Lab 3: IBM SPSS Text Analytics
The model scores the data for the categories, so browsing a model means seeing a list of the
categories created along with their descriptors.
Copyright IBM Corporation 2011-2012. All rights reserved.
48
Lab 3: IBM SPSS Text Analytics
To see the detailed information, click on a category.
-Click on the Pos: General Satisfaction category
We see each concept (descriptor) in this category, the concept's type, and another column listing
details about other terms that were added under a specific concept.
To see the effect of a text mining model, we'll add the model to the stream, connect it to the data
source, and run the data into a Table node.
-Close the Model Browsing window
-Click on the generated model query and drag it into the stream near the data source
-Connect the data source node to the generated model
-Edit the generated model
-To see the results, click on the Preview butrton
-Scroll to the right in the Preview output past the query field
Copyright IBM Corporation 2011-2012. All rights reserved.
49
Lab 3: IBM SPSS Text Analytics
Information : This could be done with a Table node and executing that path within the
stream, but for large datasets, the preview is a muck quicker option. Of course, to see
all of the results, a Table node would be needed.
New flag fields have been created, having values of T or F, for each category. A record is coded T
(true) if that category is contained in the text for that record. A record is coded F (false) if the
category is not in the text.
These new fields can now be used to generate reports, further investigate the relationship betsween
the categories, study the relationship between other fields, such as demographic information, and the
categories, or develop models that use the category fields as inputs. You could even use a category
field as an outcome and attemp to predict what factors lead to particular comments in customer calls.
-Close the Preview window
We've now seen a complete, albeit simple, example of text mining modeling. Although many details
and complications were unmentioned, and we did no editing of the dictionaries, this example has
presented the key steps in text mining.
Congratulations ! You've just finished the Workshop title workshop.
Copyright IBM Corporation 2011-2012. All rights reserved.
50

Download Report

SPSS Workshop

Paperzz.com

Your Paperzz