SPSS WORKSHOP SPSS Workshop Version : Last Modification Date : Owner : 1.0 10/October/2012 ISV & Developer Relations La Gaude Emmanuel Genard ([email protected]) Benoit Barranco ([email protected]) Armand Ruiz ([email protected]) 1 SPSS WORKSHOP Table of Contents Lab 1: Create an SPSS Model to Predict Insurance Risk 1. Load the historical date from a flat file .......................................................................................... 5 2. Define what we are trying to predict.............................................................................................. 6 3. Partitioning our data ....................................................................................................................... 7 4. Finding the best model ................................................................................................................... 8 5. Building the model and checking its accuracy .............................................................................. 9 6. Applying the model to new policy applications ........................................................................... 12 Lab 2: Retail Market Basket analysis 1. Understand the source data .......................................................................................................... 17 2. Prepare the data ............................................................................................................................ 19 3. Know our customers better. ......................................................................................................... 23 Lab 3: Text Analytics 1. Text Mining Nodes....................................................................................................................... 28 2. A Typical Text Mining Session .................................................................................................... 28 3. Import Data .................................................................................................................................. 29 4. Selecting A Set of Categories and Resources .............................................................................. 33 5. Extracted Results Pane................................................................................................................. 37 6. Data Pane ..................................................................................................................................... 38 7. Resource editor ............................................................................................................................ 41 8. Categorization .............................................................................................................................. 42 9. Generating a Text Mining Model ................................................................................................................... 48 2 Lab 1: Create an SPSS Model to Predict Insurance Risk Lab 1: Create an SPSS Model to Predict Insurance Risk Lab Objectives: In this lab we will use SPSS Modeler to predict which new home owner’s insurance applications will be least risky, based on historical data so we can offer the applicants a policy discount.. We will first use SPSS Modeler and historical data to create a model. We will then apply the model to new home owner’s insurance applications to predict which applications will be least risky, and therefore can be offered a policy discount. You will complete the following tasks: i) Loading data into SPSS Modeler from a flat file. ii) Defining the target of the model and excluding fields that take away from the model’s accuracy. iii) Partitioning data into training and testing sets. iv) Automatically finding the best algorithm for the model. v) Comparing the results of different algorithms running against the same data. vi) Checking the accuracy of a model. vii) Exporting the results of a model. Lab Duration : 40 minutes 1. Load the historical data from a flat file In this section you’ll load historical data of homeowner policies and their history of claims from a comma separated flat file. The data include both policies where the policyholder has made claims and those without claims. Our goal of the model is to predict which new policy applications will be least risky (i.e. result in no claims) based on historical data so we can offer these applications a policy discount. Step 1 Start SPSS Modeler from the Start menu (Start=>All Programs=>IBM SPSS Modeler 15.0=>IBM SPSS Modeler 15.0) Step 2 Click on Create a new stream (top right hand side of dialog). Step 3 Click the Sources tab and drag a Var File node onto the canvas Figure 1 Canvas after Var File node added Step 4 Double click on the Var. File node you just added to bring up its properties. Step 5 Click on the button with three dots to locate the source file. Copyright IBM Corporation 2011-2012. All rights reserved. 5 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 2 Locate source file Step 6 Navigate to the folder C:\SPSS Workshop\Labs\Lab1 and select the file claimed.txt Step 7 Click Open. Step 8 Click Preview to see a selection of the data in the file. Figure 3 Preview of data in source file Step 9 Click OK to close the Preview window. Step 10 Click OK to close the Var. File properties. 2. Define what you’re trying to predict In this section you’ll indicate what field you’re trying to predict and what fields will be considered in developing a predictive model. We can actually define the roles in the Var. File node - it is just recommended to use the Type Node because you can use the same Var. File node in multiple branches, or you can copy and paste it on the same canvas. If you use an external Type node then each instance can be typed differently. Step 1 Click the Field Ops tab and drag a Type node onto the canvas to the right of the Var. File node you just added. Figure 4 Dragging a Type node onto the canvas Step 2 Select the Var. File node you added previously (should now be labeled claimed.txt), right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 3 Click the Type node to complete the connection from the Var. File node. Your canvas should now look like Figure 5. Copyright IBM Corporation 2011-2012. All rights reserved. 6 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 5 Canvas after connecting Type node Step 4 Double click on the Type node to bring up its properties. Step 5 The last row represents CLAIMMADE a boolean value that indicates whether any claims have been made against a particular policy. This is the field we’re trying to predict. Click in the Role column for the last row and set the Role to Target (see Figure 6 ). Step 6 The first row represents POLICYID which will not be a good predictor because its a randomly generated unique number. Click in the Role column for the first row and set the Role to Record ID (see Figure 6 ). Step 7 Click in the Role column for the second to last row (CLAIMS) and change it to None (see Figure 6 ). This value is perfectly correlated with the target so it will cause a trivial decision tree to be generated if it is not excluded. Figure 6 Type node properties Step 8 Click Read Values (see Figure 6). Step 9 Click OK to close the properties dialog of the Type node. 3. Partitioning the data In this section you’ll partition the historical data into 2 randomly selected mutually exclusive groups. One will be used to build a predictive model and the other will be used to test the performance of the model. We partition the data set to be able to independently test the model with data where you already know the answer to the thing you're trying to predict. In this case we're using half the historical data to build the model and the other half to test it. If we were to use all the historical data just to build the model we won't have a way to verify its accuracy right away. Step 1 Click the Field Ops tab and drag a Partition node onto the canvas to the right of the Type node you just added. Step 2 Select the Type node you added previously right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 3 Click the Partition node to complete the connection from the Type node. Your canvas should Copyright IBM Corporation 2011-2012. All rights reserved. 7 Lab 1: Create an SPSS Model to Predict Insurance Risk now look like Figure 7. Figure 7 Canvas after Partition node added Step 4 Double click the Partition node to bring up its properties Step 5 Note that 50% of the data is being used to train the model and the other 50% is being used to test the model. Let’s use the default for this. Close the Partition node properties by clicking OK. 4. Finding the best model At this time we don't really know which model we should use, i.e. which model is the most accurate. Fortunately, SPSS Modeler has a special node that allows you to run multiple models against the same data and show the overall accuracy of each of them. You will then be able to select the most appropriate model for your data source. Step 1 Select the Modeling tab in the palette. Step 2 Select the Automated category at the left and then drag an Auto Classifier node onto the canvas to the right of the Partition node you just added. Step 3 Select the Partition node you added previously right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 4 Click the Auto Classifier node (should now be labeled CLAIMMADE) to complete the connection from the Partition node. Your canvas should now look like Figure 8 Figure 8 Canvas after Auto Classifier node added Step 5 Double click on the Auto Classifier node you just added to bring up its properties. Step 6 Select Overall accuracy next to Rank models by: (see Figure 9) Figure 9 Auto Classifier node properties Step 7 Click Apply and then click Run . A yellow nugget will be generated. Step 8 Arrange the nodes on the canvas by right clicking on any unused area of the canvas and Copyright IBM Corporation 2011-2012. All rights reserved. 8 Lab 1: Create an SPSS Model to Predict Insurance Risk selecting Automatic Layout from the context menu. Figure 10 Canvas after Automatic Layout Step 9 Double click on the yellow model nugget. The three most accurate models are displayed with the most accurate being C5.1. We will therefore use this one for the rest of the workshop. Figure 11 Results of Auto Classifier Step 10 Click OK to close the results of the model nugget. 5. Building the model and checking its accuracy Now we know that the C5.1 model is the best one for our data set we can build the model using that explicitly now knowing that we will get a very accurate model. We’ll also get some more details about the accuracy of the model. Step 1 Select the Modeling tab in the palette. Step 2 Select the Classification category at the left and then drag a C5.0 node onto the canvas to the right of the Partition node you added earlier. Step 3 Select the Partition node, right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 4 Click the C5.0 node (should now be labeled CLAIMMADE) to complete the connection from the Partition node. Your canvas should now look like Figure 12. Copyright IBM Corporation 2011-2012. All rights reserved. 9 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 12 Canvas after adding C5.0 node Step 5 Double click on the C5.0 node to bring up its Properties and click Run Step 6 Arrange the nodes on the canvas by right clicking on any unused area of the canvas and selecting Automatic Layout from the context menu. The canvas should now look like Figure 13. Figure 13 Canvas after running C5.0 and Auto Layout Step 7 Double click on the C5.0 model nugget to see the results. You should see the top level decision in the decision tree to the left of the window a bar chart of the importance of the various predictors. Copyright IBM Corporation 2011-2012. All rights reserved. 10 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 14 Predictor importance Step 8 Click the Preview button. You should see 2 columns added to the results. $C-CLAIMMADE is the prediction of whether each policy will have 1 or more claims and $CC-CLAIMMADE is the confidence of that prediction. Figure 15 Model predictions Step 9 Click OK to close the Preview window. Step 10 Click OK to close the model nugget properties. Step 11 Select the Output tab in the palette. Step 12 Drag an Analysis node just to the right of the C5.0 model nugget. Step 13 Select the C5.0 model nugget , right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 14 Click the Analysis node to complete the connection from the model nugget. Your canvas should now look like Figure 16. Copyright IBM Corporation 2011-2012. All rights reserved. 11 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 16 Canvas after Analysis node added Step 15 Double click on the Analysis node to bring up its properties and then click Run. Step 16 Your output should look like Figure 17. The model is 91.34 % accurate for the training records (records used to build the model) and 90.77% accurate with the testing set, so we have a fairly accurate predictive model which is about 91% accurate. Figure 17 Output of Analysis node Step 17 Click OK to close the Analysis node output. 6. Applying the model to new policy applications We will use the best predictive model, in our case C5.0, against our new policy applications to identify the lowest risk ones with the highest level of confidence. Step 1 Select the Sources tab in the palette. Step 2 Drag another Var. File node onto the bottom left of the canvas. Double click on the Var. File node you just added to bring up its properties. Step 3 Click on the button with three dots to locate the source file. Step 4 Navigate to the folder C:\SPSS Workshop\Labs\Lab1 and select the file newquotes.txt Step 5 Click Open. Step 6 Click OK to close the Var. File properties. Step 7 Single click on the new Var. File node you just added (should now be labeled newquotes.txt). Copyright IBM Corporation 2011-2012. All rights reserved. 12 Lab 1: Create an SPSS Model to Predict Insurance Risk Step 8 Double click on the C5.0 model nugget in the top right part of the window where all your model nuggets are saved Figure 18 Saved model nuggets Step 9 Your canvas should now look like Figure 19. Figure 19 Canvas after adding newquotes.txt file Step 10 Drag a Table node from the Output palette to the right of the model nugget you just added. Step 11 Select the newly added model nugget, right click and select Connect… from the context menu 13 Copyright IBM Corporation 2011-2012. All rights reserved. Lab 1: Create an SPSS Model to Predict Insurance Risk (alternatively you can press the F2 key) Step 12 Click the Table node to complete the connection from the model nugget. The bottom part of your canvas should now look like Figure 20. Figure 20 Canvas after adding Table node Step 13 Double click on the Table node to bring up its properties and then click Run. Step 14 You should see 2 columns added to the results. $C-CLAIMMADE is the prediction of whether each policy will have 1 or more claims and $CC-CLAIMMADE is the confidence of that prediction. Step 15 Click OK to close the output of the Table node. Step 16 Drag a Select node from the Records Ops palette to directly below the Table node just added. Step 17 Select the newly added model nugget, right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 18 Click the Select node to complete the connection from the model nugget. The bottom part of your canvas should now look like Figure 21. Figure 21 Canvas after Select node added Step 19 Double click on the Select node to bring up its properties . Step 20 Click on the button shown in Figure 22 to launch the Expression Builder Figure 22 Button to launch Expression Builder Step 21 On the right double click the field $C-CLAIMMADE to place it in the expression Step 22 Next type =’N’ followed by a space, followed by and then another space Copyright IBM Corporation 2011-2012. All rights reserved. 14 Lab 1: Create an SPSS Model to Predict Insurance Risk Step 23 Double click the field $CC-CLAIMMADE to place it in the expression Step 24 Next type >= 0.90 Step 25 The full expression should now be '$C-CLAIMMADE'='N' and '$CC-CLAIMMADE' >= 0.90 Step 26 Click Check to verify the expression. Step 27 Click OK to close the Expression Builder Step 28 Click OK to close the Select node properties Step 29 Drag a Sort node from the Records Ops palette the right of the Select node just added. Step 30 Select the Select node, right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 31 Click the Sort node to complete the connection from the Select node. The bottom part of your canvas should now look like Figure 23. Figure 23 Canvas after adding Sort node Step 32 Double click on the Sort node to bring up its properties. Step 33 Click on the button to select fields Figure 24 Button to select fields Step 34 Select the field $CC-CLAIMMADE and click OK Step 35 Change the Order to Descending Step 36 Click OK to close the Sort node properties. Step 37 Drag a Flat File node from the Export palette the right of the Sort node just added. Step 38 Select the Sort node, right click and select Connect… from the context menu (alternatively you can press the F2 key) Step 39 Click the Flat File node to complete the connection from the Sort node. The bottom part of your canvas should now look like Figure 25. Copyright IBM Corporation 2011-2012. All rights reserved. 15 Lab 1: Create an SPSS Model to Predict Insurance Risk Figure 25 Canvas after adding Flat File node Step 40 Double click on the Flat File node to bring up its properties. Step 41 Click on the button with three dots and navigate to c:\temp. Step 42 Name the output file lowrisk.txt and click Save. Step 43 Click Run Step 44 Save your model by selecting File->Save Step 45 Exit Modeler. Congratulations! You’ve successfully built a model to predict risk for new applications for homeowner’s insurance . Copyright IBM Corporation 2011-2012. All rights reserved. 16 Lab 2: Retail Market Basket analysis Lab 2: Retail Market Basket analysis Lab Objectives : in this lab your mission is to help a marketing team to increase the shopping cart. To do so they are willing to better understand the product sales patterns and customer profiles and preferences. The lab will guide you through two product affinity scenario: 1. First you’ll analyze transactional and demographics details to find out who is buying the different product associations. 2. Second, still using transactional data, you’ll try to figure out what kind of recommendations would likely to be accepted by the customer base. Lab Duration : xx min 1. Understand the source data Open SPSS Modeler by clicking Start > All Programs > IBM SPSS Modeler 14.2 > IBM SPSS Modeler 14.2 On the welcome screen, select Create a new stream and click Ok. Drag a Var. File node onto the stream canvas. The source data is read from a comma-delimited source file. Double-click the file node to open the Var. File node properties. In the File tab, click the Elipsis… and point to the source file: C:\SPSS_Workshop\Labs\Lab1_MBA\Sources\BASKETS1n. Keep the default options, click Apply. Copyright IBM Corporation 2011-2012. All rights reserved. 17 Lab 2: Retail Market Basket analysis 5. Click on Preview to see a snapshot (10 first records) of the data. Close the preview window. Click Ok to close the Var. File node properties window. The data set contains 18 fields with each record representing a basket (a shopping visit). Notice that within each transaction, the quantities purchased are ignored and a flag represent a purchase (or not) for each product category (wine, fruit and vegetables, fish, frozen meals…). In addition to the transaction details, data contains basket summary information (cardid: loyalty card identifier for customer purchasing this basket, value: total purchase price of basket, pmethod: method of payment for basket) and personal details of the cardholder (sex, homeown: whether or onot cardholder is a homeowner, income and age). Drag and drop a Type node from the Field Ops palette. Connect it with the BASKETS1n node. Double-click on the Types node and click Read Values button. Click Ok if a window popups. Notice the measurement level changes for of the fields. 8. Change the cardid field to Typeless as this is a key field. Change the sex field to Nominal (this is to ensure that Apriori modeling algorithm will not treat sex as a flag). Click Ok. Copyright IBM Corporation 2011-2012. All rights reserved. 18 Lab 2: Retail Market Basket analysis Information : Measurement levels (formely known as “data type” or “usage type”) describe the usage of the data fields. Note that it is different from the storage type which indicates whether data are stored as a string, integer, date… The following measurement levels are available: Continuous: used to describe numeric values such as range of 0-100. Categorical: used for string values when an exact number of distinct values is unknown. This is an uninstantiated data type, meaning all possible information about the storage and usage of the data is not yet known. Once data is read, the measurement level will be Flag, Nominal our Typeless. Flag: used for data with two distinct values that indicate the presence or absence of a trait, such as True or False. Nominal: used to describe data with multiple distinct values, each treated as a member of a set, such as small/medium/large. Ordinal: used to describe data with multiple distinct values that have an inherent order. For example, salary categories or satisfaction rankings can be typed as ordinal data. Typeless: used for data that does not conform to any of the above types, for fields with a single value, or for nominal data where the set has more members than the defined maximum. 9. From the Output palette, drag and drop a Table node and connect it to the Type Node. 10. From the toolbar, click the Run button. 11. Save the stream as bask. Organize your streams in a project by dragging and dropping the stream onto the Data Understanding folder of the CRISP-DM project explorer. Copyright IBM Corporation 2011-2012. All rights reserved. 19 Lab 2: Retail Market Basket analysis 2. Prepare the data From the Graphs palette, drag and drop a Web node onto the stream canvas. Connect the Type node to the newly added Web node. Double-click on the Web node to open the node properties. Add the eleven flag fields that represent purchases per product category. Check the ‘Show true flags only’ as we want to highlight the different associations between the products. Navigate through the various tabs to see the available options of the Web node. Web node shows the strength of relationships between values of two or more symbolic fields. The graph displays connections using varying types of lines to indicate connection strength. You can use a Web node, for example, to explore the relationship between the purchase of various items at an e-commerce site or a traditional retail outlet. Run the Web node by clicking the run button. Copyright IBM Corporation 2011-2012. All rights reserved. 20 Lab 2: Retail Market Basket analysis Because most combinations of product categories occur in several baskets, the strong links on this web are too numerous to show the groups products bought together. From the Web output window, click the to show the web summary on controls. From the Controls tab, select the ‘Size shows strong/normal/weak’ radio button. Use the sliders to set the ‘Strong links above’ 100 and ‘Weak links below’ 90. This gives you a visual look at which products sell together, based on the data set, there are 3 strong links (see screenshot on next page). Click Ok to close the Web output. From the Modeling palette, drag and drop a Apriori node onto the stream canvas. Connect the Type node to the Apriori node. Information : Product affinity is performed by using association detection algorithms which provides rules describing which fields typically occur together. IBM SPSS Modeler contains three different algorithms that perform association detection: Apriori, Carma and Sequence. Apriori has a slightly more efficient approach to association detection but has the limitation that it only accepts categorical fields as inputs. It also contains options that provide choice in the criterion measure used to guide rule generation. Carma, in contrast to Apriori, offers build settings for rule support (support both for antecedent and consequent) rather than antecepdent support. Carma also allows rules with multiple consequents, or outcomes. Copyright IBM Corporation 2011-2012. All rights reserved. 21 Lab 2: Retail Market Basket analysis Sequence is based on the Carma association rules algorithm, which uses an efficient two- pass method for finding sequences. Notice the Apriori node is named ‘No Targets’. This is because we need to specify what fields need to be predicted by the model. Double-click the Type node to edit properties. Select fields from cardid to age and set their role to None. Set fruitveg to confectionery fields role to Both (meaning that the filed can be either an input or an output of the resultant model). Click Ok to close the Type properties. Double-click the Apriori node named 11 Fields. Click the Model tab, make sure the Use partition data is selected and that settings are as followed: Information : The Use portioned data option splits the data unto separate Training, Testing (and Validation) samples. Because we’re not using the Partition node this option will be ignored. By default the Apriori will produce rules that have a minimum support of 10% of the sample, and a minimum confidence of 80%. Click Run. An Apriori generated model node will appear in the Models palette and on the stream canvas. Right-click the Apriori node in the Models palette of the Manager Window, then click Browse. The Apriori algorithm has found only three association rules displayed by Copyright IBM Corporation 2011-2012. All rights reserved. 22 Lab 2: Retail Market Basket analysis Confidence %. In addition to consequent and antecedent, Support% and Confidence% are displayed. The first rule tells us that on 17% of the records (shopping visits), beer and frozen meal were purchased. Of this group, 85.882% also bought canned vegetables. Additional measures can be displayed. Click the Show/hide criteria menu button. Click Instances. Repeat this step to display Lift. We now see that beer and frozen meal were purchased on 170 shopping trips. As there are 1000 records in the sample file, so this is (170/1000)*100, or 17.0% of the total shopping trips (Support%). The Lift value is the expected return resulting from using the model or rule. Here it is a ratio of the confidence to the base rate of consequent. The lift from the first rule is 2.834 means that the chances of buying canned vegetables almost triple when beer and frozen meals are purchased together. Information : The findings we obtained algorithmically using the Apriori model look about the same as those obtain visually using Web graph, that is Canned Vegetables, Beers and Frozen Meals are often bought together. Based on the findings our demographics is divided into three categories: 1. Those who buy Fruits and Vegetables along with Fish, who might be called Healthy Eaters. 2. Those who buy wine and confectionery. 3. Those who buy Beer, Frozen meals and Canned Vegetables. Now that we have identified three groups of customers based on the types of products they buy, we would also like to know who these customers are, that is their demographic profile. Copyright IBM Corporation 2011-2012. All rights reserved. 23 Lab 2: Retail Market Basket analysis 3. Know our customers better. We will derive a flag for each group. That is we want to have additional fields in our data that indicates where a customer is part of a group or not. 1. From the Outputs tab in the Manager Window pane, Display the Web of 11 Fields which you executed in Task 1. 2. Using the right mouse button, click the link between fruitveg and fish to highlight it. 3. Click Generate Derive Node For Link. This will add a derive node onto the stream canvas named T_T. 4. Edit the resulting derive node to change the derive field name to Healthy. 5. Repeat those steps with the link from wine to confectionery naming the resultant derive node Wine_Chocs. 6. The remaining group is composed of three links. From the toolbar, click View>Explore Mode. Holding down the shift key select all three links in the cannedveg, beer and frozenmeal triangle. 7. From the web display menus select Generate>Derive Node (“And”). Copyright IBM Corporation 2011-2012. All rights reserved. 24 Lab 2: Retail Market Basket analysis 8. Rename the resulting derive node beer_beans_pizza. 9. To profile these customer groups, connect the existing Type node to these three Derive nodes in series. 10. Attach to each derive node another Type node as shown in the following screenshot. 11. For each of the new Type nodes, set the role of all fields to None, except for the values, pmethod, sex, homeown, income and age which should be set to Input. 12. Set the role of the relevant customer group (for example, beer_beans_pizza) to Target. Copyright IBM Corporation 2011-2012. All rights reserved. 25 Lab 2: Retail Market Basket analysis 13. From the Modeling palette, Attach a C5.0 node (used for rule induction) to each of the Type nodes to build rule-based profiles of these flags. 14. Set the output type for each C5.0 Node to Rule set. 15. Run each model. 16. Each resultant model contains a clear demographic profile for the group. Let’s analyze the beer_beans_pizza model by right-clicking the nugget from Models tab and select Browse. From the Model tab, expand Rules for T. 17. The above rule means that customer buying beers, pizzas and canned vegetables at mostly male with a low income. The predictor importance shows which fields were considered as the most important to build the rule-based profile. Copyright IBM Corporation 2011-2012. All rights reserved. 26 Lab 2: Retail Market Basket analysis 18. Looking at the demographic that buys wine and confectionery, we realize that their profile is mostly female with a higher income. Value is also part of the predictor, meaning that shopping basket amount is higher for those customers. In the retail domain, such customer groupings might, for example, be used to target special offers to improve the response rates to direct mailings or marketing actions or to customize the range of products (the assortment) stocked by a branch to match the demands of its demographic base. Copyright IBM Corporation 2011-2012. All rights reserved. 27 Lab 3: IBM SPSS Text Analytics Lab 3: IBM SPSS Text Analytics Lab Objectives : Text mining in Modeler is accomplished via several nodes that are specialized for handling text. We begin this chapter by briefly discussing each node to familiarize you with them and the text-mining environment in Modeler. We will then run through a simple text-mining example that will allow you to see how the Text Mining node operates, the extraction of concepts from text, the interactive workbench environment, and the grouping of these concepts into higher-levels categories. This example will provide a practical foundation for the remainder of the course. It is based in IBM Training of SPSS Analytics – Cours 0A003FR Lab Duration : 60 min 1. Text Mining Nodes Text Analytics is an add-on option that includes five nodes that read or process text. These nodes are contained in their own Text Mining Palette. Importing Text Text can be stored in a database, Statistics file, or a spreadsheet and imported into Modeler for processing, just as with structured data. Two nodes allow you to import text from other commong source for text mining. -The File List source node generates a list of document names as input to the text mining process. This node is used when the text resides in external documents rather than one or more fields in a database or other file. The node outputs a single field with one record for each document or folder listed. -The Web Feed source node reads in text from Web feeds, such as blogs or nexs feeds in RSS or HTML formats. The node outputs one or more fields for each record found in the feeds. Translation Copyright IBM Corporation 2011-2012. All rights reserved. 28 Lab 3: IBM SPSS Text Analytics Modeler has the ability to translate text from a variety of languages into English through the use of Language Weaver. -The Translate node is used to translate text, from either fields or documents, from several supported languages, such as Arabic, Chinese, and Russian, into English for modeling. This makes it possible to mine text in a language even if analyst are unable to read that language. The same functionality can be invoked from the text modeling nodes, but use of a separate Translate node makes it possible to cache and reuse a translation in multiple nodes. Text Mining and Analysis Modeler includes two nodes to mine text to extract concepts and to find links between them. -The Text Mining node ses linguistic methods to extract key concepts from the text, allows you to create categories with these concepts and other data, and offers the ability to identify relationships and associations between concepts based on known patterns (called text link analysis). The node can be used to explore the text data, or to produce either a concept model or category model, and then use that generated model information in subsequent modelling. -The Text Link Analysis node extracts concepts and also identifies relationships between concepts based on known patterns withing the text. Pattern extraction can be used to discover relationships between your concepts, as well as any opinions or qualifiers attached to these concepts. Viewing Text within Files The File Viewer node allows you to view text contained in documents from within Modeler. -The File Viewer node provides you with direct access to your original text data in documents, however you used that data in a stream. The node can help you better understand the results from text extraction by providing you access to the source text from which concepts were extracted, since it is otherwise inaccessible in the stream. 2. A Typical Text Mining Session Conducting a text mining analysis comprises these essential steps: Copyright IBM Corporation 2011-2012. All rights reserved. 29 Lab 3: IBM SPSS Text Analytics Import: Text data are read into Modeler, either in a field, from documents, or from a web feed. Extraction: An extractor engine automatically locates and collects the key terms from the text data. It also collects these terms into higher-level groups. Types are collections of similar terms, such as organizations, products, or positive statements. Patterns are combinations of terms and types that represent qualifiers, such as positive comments about an organization. Categorization: Using linguistic methods, co-ocurrence rules, or a standard term frequency approach, categories are created from the extracted results. The categories represent higherlevel concepts that capture the chief ideas and key information in the text data. Model Generation: Once categories are created, a text mining model node can be generated that allows text data to be scored so that text data can be turned into structured information, typically flag fields indicating wheter a concept is expressed in the text in a record, document, etc. Editing Resources: Although the four steps above are sufficient to conduct text mining, they are usuarrly not adequation. Invariably, you will edit the dictionary ressources supplied with Modeler, adding information and making modifications to insure that the appropiate terms are extracted and categories created; This editing is done using the interactive workbench. Editing is an iterative process, and it can occur at any point after data have been imported. We will follow all these steps in the following Lab. 3. Import Data We will launch Modeler and begin a new stream. For this example, we use data file. We don't provide much background information here about this data except to note that the responses come from calls with problems or complaints to a call center for a telecommunications firm. The data were recorded from two months of calls and are a sample of all the calls received in that period. The data are contained in the text file: C://SPSS Workshop/Labs/Lab2 - Text Analytics/Source/Data/Astroserve0304.txt -Add a Var. File node from the Sources palette to the stream canvas. -Edit the Var. File node -Select the file Astroserve0304.txt from the C://SPSS Workshop/Labs/Lab2 - Text Analytics/Source/Data/ directory Copyright IBM Corporation 2011-2012. All rights reserved. 30 Lab 3: IBM SPSS Text Analytics -Click the Tab check box in the Delimiters area, and deselect the Comma check box -In the Quotes area, select Include as text for Single quotes and select Discard for Double quotes: How quotes are handled can be critical when text is read. Also, since text data often contains commas, normally a comma is not used as the delimiter character. To preview the data, click on the Preview button. Copyright IBM Corporation 2011-2012. All rights reserved. 31 Lab 3: IBM SPSS Text Analytics There is an ID field for each customer call (Query_ID), day and month fields, and the text field itself (query). The text can be quite lengthy, and the entries are of varied length. Note that there are spelling errors in the text (lightening and recieved), abbreviations (cust and i/net), different date formats, and special terms specific to this organization or industry (ntu and DTP). Sometimes the linguistic resources will automatically handle these situations, but often some editing of the dictionary resource will be necessary. Next we'll add a text mining modeling node to the stream. -Close the Table window -Add a Text Mining node from the Text node to the stream -Connect the Text Mining node to the Var. File node -Edit the Text Mining node Information : You may have noticed a delay when adding the Text Mining node to the stream. The first time you add a Text mining node to the stream in a Modeler session, the software has to install resources to be available to the node when executed. This makes the node “heavy” and requires more loading time. Data can be stored in a separate text field, as in the current data file, or in separate documents. When in documents, the path-name, or location, of the documents, is specified along with other settings appropriate to the text and its format. -Click the Field Chooser button and select query as the Text field. Copyright IBM Corporation 2011-2012. All rights reserved. 32 Lab 3: IBM SPSS Text Analytics -Click the Model tab Text mining can be executed in two different modes from this node. The default is Build interactively (category model nugget), which will open a special interactive environment in which you can perform exploratory analysis (such as clustering and text link analysis), edit the linguistic resources, Copyright IBM Corporation 2011-2012. All rights reserved. 33 Lab 3: IBM SPSS Text Analytics create categories, either automatically or manually, or otherwise fine-tune the text mining results. Alternatively, the Generate directly (concept model nugget) selection will build a model automatically, using the node settins and the linguistic resources that are selected (in the Resource template area). We will use the interactive workbench in this example. -Click on the Exploring Text link analysis (TLA) results 4. Selecting A Set of Categories and Resources In the final step in the New Project Wizard, you can select a text analysis package (TAP) that contains the linuistic resources and categories that best fit your text data. A text analysis package (TAP) includes the linguistic resources -the libraries, dictionaries, and leralted resources- which are used to extract key concepts from the text. Also included are one or more category sets, which each contain an enhanced code frame. In Text Analytics for Modelers there are several pre-built TAP files for English language text. Each TAP file shipped with this product is fine-tuned for a specific type of survey, such as employee, product, and customer satisfaction surveys. You can also save your own custom TAP created from a combination of shipped resources and changes and additions that you make. You can then choose this custom TAP for a new project. Even if you don't have survey data that exactly fits this types, the TAPs may be helpful. You just need to broaden the view of “product” or “customer” when thinking about your data. For example, course evaluation surveys are essentially product evaluation surveys where the product is the course. -Click on the Text analysis packgage option under Copy Resources From -Click on the Load button -Click on the Customer_Satisfaction.tap -Click on the Mixed Opinions option under Category Sets and Click on the Load button Copyright IBM Corporation 2011-2012. All rights reserved. 34 Lab 3: IBM SPSS Text Analytics We will make one change to the Expert setting. -Click the Expert tab -Click the Accomodate spelling for a minimum root character limit of check box The Expert tab contains options to control how text is extracted, and many of the choices deal with problems in text, such as spelling and punctuation errors. The option to try to fix spelling errors is turned off by default, but we'd like to use it for this text because text entry was rather haphazard and done on the fly by customer service representatives. Copyright IBM Corporation 2011-2012. All rights reserved. 35 Lab 3: IBM SPSS Text Analytics We have now made all the necessary selections to execute the Text Mining node: -Click Execute When the node is executed, two things happen. The interactive workbench environment opens, and text extraction commences. There is a text extraction progress window that list the teps and progress in the text extraction process. Information : The first time a Text Mining executes in a Modele session, it will take longer thant subsequent executes, even with the same settings. When it first executes, the template to be used for text mining has to be loaded. When extraction is complete, the extracted results are displayed in the interactive workbench window. Copyright IBM Corporation 2011-2012. All rights reserved. 36 Lab 3: IBM SPSS Text Analytics -Click on the View button in the upper right hand corner of the Interactive workbench and select Categories and Concepts to see the extracted concepts and categories. Copyright IBM Corporation 2011-2012. All rights reserved. 37 Lab 3: IBM SPSS Text Analytics The interactive workbench window has four different views for different types of analysis or for editing the dictionary resources. The default view is Categories and Concepts, and in this view there are four panes: Extracted Results pane: Located in the lower left corner, this pane is the area in which you perform an extraction and where the results can be edited and refined. The extracted concepts, or terms, are listed, along with their type and frequency in the text. Data pane: Located in the lower right corner, this pane is initially blank and is uised to present the text data corresponding to selections in the other panes. The text is not displayed automatically but will be displayed when the Display button in the Extracted Results or Categories pane is clicked. Categories pane: Located in the upper left corner, this area presents a table of the cateogires that have been created along with their frequency in the text. The categories can be managed and edited from this pane. Visualization pane: Located in the upper right corner, this pane provides various graphical representations of categorization (and will provide other types of visualization for other views). The graphs include a bar chart of categories, a web graph showing category relationships, and a table displaying the same in a more traditional format. The visual display corresponds to what is selected in the other panes. 5. Extracted Results Pane The default display in this pane shows the extracted concepts; These are single words, such as complaint or compound words, such as phone service. All words are displayed in lower case. The concepts are ordered by their document frequency (Docs column), which is the number of documents/records in which they appear. Thus, the concept service appear in 1,221 records, which is 23% of the total number of customer calls. A second column listing frequency is headed “Global” and represents the number of times a concept appears in the entire set of documents or records. If a concept appears three times in one record, it is counted three times for the Global frequency. Thus, the concept service appears 1,745 times, which is 2% of the total number of occurrences of all concepts. When concepts are extracted, they are assigned a Type to help group similar concepts. They are color coded according to their type. We see that all the concepts visible have a type of Unknown and have a Copyright IBM Corporation 2011-2012. All rights reserved. 38 Lab 3: IBM SPSS Text Analytics blue color. There are several built-in types delivered with Text Mining for Modeler, such as Location, Product, Person, Positive (qualifiers) and Negative (qualifiers). The most frequent concepts are cust and customer, which is expected for call center data from customers. The first interesting concetps that may be important for text mining that are visible are probably service, complaint, and fault. There are two views in the Extracted Results pane, Concepts and Type. Let's look at the types next. -Click the View selection button and select Type. The display is similar to that for concepts, with frequencies by both documents and globally. By far the most frequent type is Unknown, and you can expect this to be true the first time you extract text data using the default linguistic resources. Even so, Modeler has found 2,351 occurrences of dates, 1814 occurrences of persons, and 5,653 occurrences of budget related items times in the text, among other types. Copyright IBM Corporation 2011-2012. All rights reserved. 39 Lab 3: IBM SPSS Text Analytics 6. Data Pane If we wonder about the specific terms grouped under a type, we can view the original text data in the Data Pane. Let's do that for the Organization type. -Click the <Organization> type to select it -Click the Display button The Data pane displays one row per document or record corresponding to your slection in another pane (in this instance, the Extracted Results pane). If the text data is realtively short in length, the text field displays most or all of the text data. But the call center records can be quite lengthy, and then the text field column shows a short portion of the text and there is a Text Preview pane to the right that displays more of the text. This may not be visible immediately. To see it: -Maximize the interactive workbench window -Click on the second text record to select it -Move the pane divider between the two columns so both are visible The word or words that were extracted and placed in the Organization type are highlighted in yellow. For the second text entry, the firm name nokia was extracted from this call center entry. Copyright IBM Corporation 2011-2012. All rights reserved. 40 Lab 3: IBM SPSS Text Analytics In the text data, all words in color have been extracted, so you can see that a large portion of the text in this record was extracted. All words in balck were no textracted. The words that are not extracted are often connectors, verbs or pronouns. These words are ussed during the extraction to make sense of the text but are not temrs in themselves. Some words that could be vers, such as repair have been extracted, but in the context of the text, repair is a noun, not a verb. This is a result of the natural languange processing used by Modeler. -Click the dropdown list in the Extracted Results pane and select Concept. The Extracted Results can be sorted by the contents of any column, in alphabetic order or by frequency. To sort, repeatedly click in a column header to select the sort order you prefer. We want to order the extracted concepts in reverse alphabetical order. -Click once in the Concept column header until you see this display -Scroll down until you see the word mobile in the entries There are many separate concepts that begin with the word mobile. In the extraction process, although these terms are realted, as they all refer to mobile (cell phone) use or service, they are keep separate. This is to allow maximum flexibility, but also because text extraction is, in reality, a twostep process. First the program must find the meaningful information in the text. Second, the program Copyright IBM Corporation 2011-2012. All rights reserved. 41 Lab 3: IBM SPSS Text Analytics can then group together related concepts (the process of categorization). We”ll see what Modeler handled these concepts when we exaine the categorization results. Before doing so, we brifly review some of the resources that are used in extracting and typing terms. 7. Resource editor The Resource Editor in the interactive workbench allows you to edit and refine the linguistic resources to tune the dictionaries for the specific type of text you are mining. -Click the View button in the upper right hand corner of the interactive workbench select Resource Editor and If necessary, click Customer Satisfaction Opinions in the library tree in the upper left corner The Resource Editor window displays four panes. The top left pane shows libraries contained within the Customer Satisfaction Opinions Text Analysis Package (TAP). There is always a Local library in an interactive workbench session, by default, although it is emplty when you begin. -Click on the Core Library (Engligh) Copyright IBM Corporation 2011-2012. All rights reserved. 42 Lab 3: IBM SPSS Text Analytics The Type pane is located in the upper center section of the window and siplays the types and associated terms. We see that great britain and los angeles are associated with the Location type and are in the Core library. You can scroll through the type window to see the various terms and types (notice that all text in lower case). It probably seems odd that only a few locations are listed, but this is because the vast majority of supplied type information is in compliled resources that are not visible to the user. Thus Modeler is able to successfully recognize thousands of geographical locations from the compliled Location type resources. You only need to add other locations if the are not recognized in extraction. The lower pane shows synonyms that are applied to handle words with the same meaning, plural forms, and spelling variants. The Target term will be displayed and used in the extracted results, and any of the Synonyms found will be replaced by the target. Invariably, you will make changes to the synonyms to tune the results for you specific text. The fourth pane on the right hand side is the Excluded pane and lists words that are not to be extracted, usually because they are not meaningful or they add clutter to the resutls. In the call center data, the words cust and customer fall into this category and, we may want to consider excluding them from extraction. 8. Categorization Once you have extracted terms, the next step is to categorize the responses using various automatic methods. Categories are higher-level concepts that represent higher-level ideas and information in the text. They can also represent all those terms that use certain workids, such as the terms using mobile. Categories can also represent a single concept depending on the choice of method and settings. Some concepts are important enough, or unusual enough, that they represent a distinct category. Category names are based on the key concept(s) extracted from the set of text. Because we use the Customer Satisfaction Opinions TAP, categories were already included. Copyright IBM Corporation 2011-2012. All rights reserved. 43 Lab 3: IBM SPSS Text Analytics The Docs column now has an icon with two arrows in each cell for each category rather than a document count. This is because the data must be scored to determine which records contain which categories. -Click the Score button Another progress window appears (not shown here) After scoring the results, we can see the number of records for each category listed. The category labeled Uncategorized lists how many responses are uncategorized (here 1150). The category names have the type of the key concept added as a prefix. Hence, the category “Pos: Product: Desing/Feature” is a category created from concepts that capture comments from customers who indicate that they are generally satisfied with the design and features of the various products they have. Copyright IBM Corporation 2011-2012. All rights reserved. 44 Lab 3: IBM SPSS Text Analytics Let's look at one of the categories in more detail. -Double-click on the category Neg: Product: Functioning The symbol by each component of this category indicates that each of them is a rule. Rules themselves consist of two or more concepts, types, or patterns. For example, there is a rule linking clarification of issues for the customer and non-positive comments: Copyright IBM Corporation 2011-2012. All rights reserved. 45 Lab 3: IBM SPSS Text Analytics [<*clarification* & !(<Positive>)] The excalamtion point is the not symbol, so this rule says to place reponses in this category if they mention wait time and a non-positive comment. If you click on this rule and then click the button, you can see some of the responses associated with it in the Data pane. The relevant text used for categorization will be highlighted. Another category that was automatically created was Neg: Pricing and Billing. This category could potentially be ideal for identifying customers who have concerns about the company's billing practices. In addition, we can use the Visualization pane to get and indication of what other categories occur commonly with this category. A Category Bar chatt will provide exactly this information. -Click the Neg: Princing and Billing category in the categories list -Click the Category Bar tab in the Visualization pane (if necessary) -Click the Display button in the Categories pane After a few moments, a bar chats appears in the Visualization pane. You may need to adjust comumn sizes so that the full Docs column count is visible. Copyright IBM Corporation 2011-2012. All rights reserved. 46 Lab 3: IBM SPSS Text Analytics This chart list the overlap between categories for the 528 records that contains terms in the Neg: Pricing and Billing category. We see that the Neg: Product: Functioning category occurs in 181 of the 363 records, or 34..3%. This type of information can be very helpful in either a) understanding a category or b) thinking about other categories that might be created (by grouping two, or more, categories that occur together frequently). 9. Generating a Text Mining Model The last step in the text mining modeling process is the creation of a generated text mining model that can be used to score (categorize) data. After scoring data, the text information can be combined with structured data to create a variety of models. We will generate a model, browse it, and then add it to the stram to review its output. Models are created from the Generated menu in the interactive workbench. -Click Generate...Generate Model -Click File...Close to close the interactive workbench Modeler provides you with an option to save the interactive session so you can return to it in the current state, exit the session without saving, or close the window but keep the session available. For this example, we'll simply exit. -Click Exit button You will be returned to the Modeler environment. -Rught-click on the generated model named query in the Models manager area -Click Browse Copyright IBM Corporation 2011-2012. All rights reserved. 47 Lab 3: IBM SPSS Text Analytics The model scores the data for the categories, so browsing a model means seeing a list of the categories created along with their descriptors. Copyright IBM Corporation 2011-2012. All rights reserved. 48 Lab 3: IBM SPSS Text Analytics To see the detailed information, click on a category. -Click on the Pos: General Satisfaction category We see each concept (descriptor) in this category, the concept's type, and another column listing details about other terms that were added under a specific concept. To see the effect of a text mining model, we'll add the model to the stream, connect it to the data source, and run the data into a Table node. -Close the Model Browsing window -Click on the generated model query and drag it into the stream near the data source -Connect the data source node to the generated model -Edit the generated model -To see the results, click on the Preview butrton -Scroll to the right in the Preview output past the query field Copyright IBM Corporation 2011-2012. All rights reserved. 49 Lab 3: IBM SPSS Text Analytics Information : This could be done with a Table node and executing that path within the stream, but for large datasets, the preview is a muck quicker option. Of course, to see all of the results, a Table node would be needed. New flag fields have been created, having values of T or F, for each category. A record is coded T (true) if that category is contained in the text for that record. A record is coded F (false) if the category is not in the text. These new fields can now be used to generate reports, further investigate the relationship betsween the categories, study the relationship between other fields, such as demographic information, and the categories, or develop models that use the category fields as inputs. You could even use a category field as an outcome and attemp to predict what factors lead to particular comments in customer calls. -Close the Preview window We've now seen a complete, albeit simple, example of text mining modeling. Although many details and complications were unmentioned, and we did no editing of the dictionaries, this example has presented the key steps in text mining. Congratulations ! You've just finished the Workshop title workshop. Copyright IBM Corporation 2011-2012. All rights reserved. 50
© Copyright 2026 Paperzz