| Oracle® Data Mining Application Developer's Guide 11g Release 2 (11.2) Part Number E12218-03 |
|
|
View PDF |
Oracle Data Mining supports the mining of data sets that have one or more text columns. These columns must undergo a special preprocessing step whereby text tokens known as terms are extracted and stored in a nested column. The transformed text can then be used as any other attribute in the building, testing, and scoring of models. Any algorithm that supports nested data can be used for text mining. (See Table 3-2 in Chapter 3.)
This chapter explains how to use PL/SQL to prepare a column of text for mining.
Note:
Oracle Data Mining includes sample programs that illustrate text transformation and text mining in both PL/SQL and Java. Refer to Oracle Data Mining Administrator's Guide for information about the Oracle Data Mining sample programs.See Also:
"Using Text Transformation" for information about the text transformation interface n the Java API
Oracle Data Mining Concepts for information about text mining
This chapter contains the following sections.
Oracle Data Mining uses specialized Oracle Text routines to preprocess text data. Oracle Text is a technology within the database for building text querying and classification applications.
Oracle Text provides the following facilities for the Oracle Data Mining term extraction process:
SVM_CLASSIFIER, defined in the CTX_DLL Oracle Text PL/SQL package, specifies an index preference for Oracle Data Mining term extraction.
The CTXSYS.DRVODM Oracle Text PL/SQL package defines the table functions, FEATURE_PREP and FEATURE_EXPLAIN, which generate intermediate and final tables of text terms for Oracle Data Mining.
In PL/SQL, the term extraction process requires the use of these Oracle Text facilities. Java developers can use the OraTextTransform interface, which presents the Oracle Text term extraction capability within the context of a Java environment. (See "Using Text Transformation".)
See Also:
Oracle Text Application Developer's Guide and Oracle Text Reference for information on Oracle TextNote:
The Oracle Text facilities for Oracle Data Mining are documented in this chapter. They are not documented in the Oracle Text manuals.Note on Terminology:
Text terms are also known as features. In text mining, a feature is a word or group of words extracted from a text document. Features (terms) are the fundamental unit of text that can be manipulated and analyzed.The feature extraction mining function (NMF algorithm in Oracle Data Mining) and text mining transformation both perform a kind of feature extraction.
A feature extraction model creates features that represent fundamental qualities of multiple attributes. The model operates on the features instead of the original attributes.
Text transformation changes a text document into a collection of features, each one representing a fundamental characteristic of the document. The model operates on the text features instead of the original document.
A good place to start in learning the term extraction process is with the sample programs. The following sample programs contain term extraction code for text mining:
dmsh.sql — Prepares the build, test, and scoring data for the sample programs, including the text mining programs. dmsh.sql creates views for data mining and tables and indexes for text mining.
dmtxtfe.sql — Using the indexed text column created by dmsh.sql, creates a nested column suitable for text mining.
The dmtxtfe.sql program is a sample term extractor. It contains extensive comments that explain the code in a step-by-step fashion. You can expand this program into a complete term extraction solution by adding index creation and the preparation of test and scoring data (as in dmsh.sql).
Two sample PL/SQL programs use the data prepared by dmsh.sql to mine text.
dmtxtnmf.sql creates and applies a model that uses Non-Negative Matrix Factorization.
dmtxtsvm.sql creates and applies a model that uses SVM classification.
Both these programs mine a table of customer data that includes a nested column of text data called COMMENTS. The COMMENTS column has been pre-processed by dmsh.sql. The models created by these programs are referenced in the following example from a Linux system.
-- Run the programs to create the models SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtnmf.sql SQL> @ $ORACLE_HOME/rdbms/demo/dmtxtsvm.sql -- List the models created by the programs SQL> SELECT model_name, mining_function, algorithm FROM user_mining_models; MODEL_NAME MINING_FUNCTION ALGORITHM ------------------------ --------------------------- ------------------------ T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES T_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
See Also:
Oracle Data Mining Administrator's Guide. This manual provides instructions for obtaining and running the sample programs. It includes information about the build, training, and scoring data used by these programs.The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS from columns of type VARCHAR2 or CLOB. Each row of the nested table specifies an attribute name and a value. The DM_NESTED_NUMERICALS type defines the following columns.
attribute_name VARCHAR2(4000) value NUMBER)
The term extraction process treats the text in each row of the original table as a separate document. Each document is transformed to a set of terms that have a numeric value and a text label. Within the nested table column, the attribute_name column holds the text and the value column holds the numeric value of the term, which is derived using the term frequency in the document and in the document collection (other rows).
For example, the following query returns various attributes of customer 102998, including a text column of comments. The text column has not been transformed.
SQL> select cust_id, cust_gender, cust_income_level, affinity_card, comments
from mining_build_text
where cust_id = 102998;
CUST_ID C CUST_INCOME_LEVEL AFFINITY_CARD COMMENTS
------- -- -------------------- ------------- --------------------------------
102998 M J: 190,000 - 249,999 1 I wanted to write you to let you
know that I've purchased several
items at your store recently and
have been very satisfied with my
purchases. Keep up the good work.
The following query returns the same attributes of customer 102998, but the text in the comments column has been transformed. The query extracts the ATTRIBUTE_NAME and VALUE columns from the nested table that holds the transformed text.
SQL> select b.cust_id, b.cust_gender, b.cust_income_level, b.affinity_card, n.*
from mining_build_nested_text b,
table(b.comments) n
where b.cust_id = 102998
order by n.attribute_name;
CUST_ID C CUST_INCOME_LEVEL AFFINITY_CARD ATTRIBUTE_NAME VALUE
------- -- ------------------- ------------- -------------- --------
102998 M J: 190,000 - 249,999 1 GOOD .26894
102998 M J: 190,000 - 249,999 1 ITEMS 158062
102998 M J: 190,000 - 249,999 1 KEEP 238765
102998 M J: 190,000 - 249,999 1 KNOW .2006
102998 M J: 190,000 - 249,999 1 LET 299856
102998 M J: 190,000 - 249,999 1 PURCHASED 142743
102998 M J: 190,000 - 249,999 1 PURCHASES 173146
102998 M J: 190,000 - 249,999 1 RECENTLY .195223
102998 M J: 190,000 - 249,999 1 SATISFIED .355851
102998 M J: 190,000 - 249,999 1 SEVERAL .355851
102998 M J: 190,000 - 249,999 1 STORE .0712537
102998 M J: 190,000 - 249,999 1 UP .159838
102998 M J: 190,000 - 249,999 1 WANTED .355851
102998 M J: 190,000 - 249,999 1 WORK .299856
102998 M J: 190,000 - 249,999 1 WRITE .355851
The ATTRIBUTE_NAME column holds an item of text from the original comments column. The VALUE column holds the term value. Note that not all words from the original comments column are extracted as terms. For example, the articles the and to are not included.
The steps in the term extraction process are summarized in this section. Further details and specific syntax requirements are explained later in this chapter.
First transform the text in the build data. During this process you will generate the text term definitions, which you will reuse for the test and apply data. Perform the following steps:
Create an index on the text column in the build table.
Define a table to hold the categories specified by the SVM_CLASSIFIER index.
Use the FEATURE_PREP table function to create the term definitions and populate an intermediate terms table.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original build table (using a view or another table), replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.
The test and apply data must undergo the same pre-processing as the build data. To transform the test and apply data, you will reuse the term definitions generated for the build data. Perform the following steps:
Create an index on the text column in the test or apply table.
Use the FEATURE_PREP table function to populate an intermediate terms table. Use the term definitions previously generated for the build data.
Use the FEATURE_EXPLAIN table function to populate the final terms table.
Replicate the columns of the original test or apply table, replacing the text column with a nested table column. Load the terms from the final terms table into the nested table column.
Oracle Text processing requires a text index. Oracle Text supports several types of indexes for querying, cataloging, and classifying text documents. The Oracle Data Mining term extraction process requires a CONTEXT index for text querying.
You must create an index for each text column to be transformed. Use the following syntax to create the index.
SQL>CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS ctxsys.context PARAMETERS ('nopopulate');
Note:
This statement creates a basicCONTEXT index. You can further define the characteristics of the index by specifying additional arguments to the CREATE INDEX statement. Refer to Oracle Text Reference for details.Oracle Text supports index preferences for overriding the default characteristics of an index. The CREATE_PREFERENCE procedure in the Oracle Text package CTX_DDL creates a preference with the name and type that you specify. The SVM_CLASSIFIER preference type defines the characteristics of an index for Oracle Data Mining.
You must create an index preference when you prepare the build data. It will be reused when you prepare the test and apply data. Use the following syntax to create the index preference.
SQL>EXECUTE ctx_ddl.create_preference('preference_name', 'SVM_CLASSIFIER');
The SVM_CLASSIFIER index preference uses a predefined table with two numeric columns: ID and CAT. ID holds the case ID; CAT holds the category. The category table is used for internal processing. You must create the category table using the following syntax.
SQL>CREATE TABLE category_table_name(id NUMBER, cat NUMBER);
The FEATURE_PREP table function in the CTXSYS.DRVODM Oracle Text package extracts terms from a text column using an index preference of type SVM_CLASSIFIER. FEATURE_PREP creates a table of term definitions from the build data and reuses these definitions for the test and apply data.
FEATURE_PREP returns an intermediate terms table.
FEATURE_PREP is an over-loaded function that accepts two different sets of arguments. You will specify one set of arguments for the build data and another set for the test and apply data.
--- syntax for build data ---
CTXSYS.DRVODM.FEATURE_PREP (
text_index IN VARCHAR2,
case_id IN VARCHAR2,
category_tbl IN VARCHAR2,
category_tbl_id_col IN VARCHAR2,
category_tbl_cat_col IN VARCHAR2,
feature_definition_tbl IN VARCHAR2,
index_preference IN VARCHAR2)
RETURN DRVODM;
--- syntax for test/apply data ---
CTXSYS.DRVODM.FEATURE_PREP (
text_index IN VARCHAR2,
case_id IN VARCHAR2,
feature_definition_tbl IN VARCHAR2,
RETURN DRVODM;
FEATURE_PREP returns the following columns. The SEQUENCE_ID column holds the case ID; the ATTRIBUTE_ID column holds the term ID.
Name NULL? Type ---------------------- ------- ------ SEQUENCE_ID NUMBER ATTRIBUTE_ID NUMBER VALUE NUMBER
FEATURE_PREP accepts the arguments described in Table 4-1.
Table 4-1 FEATURE_PREP Table Function Arguments
| Argument Name | Data Type | |
|---|---|---|
|
|
|
Name of the index on the text column in the build, test, or apply table. |
|
|
|
Name of the case ID column in the build, test, or apply table. |
|
|
|
Name of the table used by the Specify this argument only for build data. |
|
|
|
Specify 'id'. This is the name of the Specify this argument only for build data. |
|
|
|
Specify 'cat'. This is the name of the Specify this argument only for build data. |
|
|
|
Name of the term definition table created by Name Null? Type --------------------------------- CAT_ID NUMBER TYPE NUMBER RULE BLOB |
|
|
|
Name of the Specify this argument only for build data. |
The following example creates an intermediate terms table called txt_term_out. The FEATURE_PREP table function extracts terms from a text column with an index called build_text_idx. The text column is in a build table with a case ID column called cust_id. The index preference txt_pref is applied to the index using the id and cat columns in the table cat_tbl. FEATURE_PREP creates a table of term definitions called txt_pref_terms.
CREATE TABLE txt_term_out AS
SELECT *
FROM TABLE(ctxsys.drvodm.feature_prep (
'build_text_idx',
'cust_id',
'cat_tbl',
'id',
'cat',
'txt_pref_terms',
'txt_pref'));
The FEATURE_EXPLAIN table function in the CTXSYS.DRVODM Oracle Text package extracts the term values from the definitions created by FEATURE_PREP and appends the associated word to each value.
FEATURE_EXPLAIN returns the final terms table.
The calling syntax of FEATURE_EXPLAIN is described as follows.
CTXSYS.DRVODM.FEATURE_EXPLAIN (
feature_definition_tbl IN VARCHAR2,
RETURN DRVODM;
FEATURE_EXPLAIN returns the following columns.
Name Type --------------- --------------- text VARCHAR2(160) type NUMBER(3) ID NUMBER score NUMBER
FEATURE_EXPLAIN accepts a single argument: the terms definition table created by FEATURE_PREP.
The following example creates a final terms table called txt_final_terms using the intermediate terms table txt_term_out. The FEATURE_EXPLAIN table function returns the terms specified in the terms definition table txt_pref_terms.
SQL> create table txt_final_terms as
select A.sequence_id, B.text, A.value
FROM txt_term_out A,
TABLE(ctxsys.drvodm.feature_explain(
'txt_pref_terms')) B
WHERE A.attribute_id = B.id;
Use the final terms table to populate a nested table column of type DM_NESTED_NUMERICALS.
The following example creates the table mining_build_nested_text. (Alternatively, you could create a view.) The table has a case ID column of customer IDs and three customer attribute columns: age, education, and occupation. It also includes a comments column of type DM_NESTED_NUMERICALS created from the terms table txt_final_terms.
SQL> CREATE TABLE mining_build_nested_text
NESTED TABLE comments store AS build_comments
AS
SELECT non_text.cust_id,
non_text.age,
non_text.education,
non_text.occupation,
txt.comments
FROM
mining_build_text non_text,
( SELECT features.sequence_id,
cast(COLLECT(dm_nested_numerical(features.text,features.value))
as dm_nested_numericals) comments
FROM txt_final_terms features
group by features.sequence_id) txt
WHERE non_text.cust_id = txt.sequence_id(+);
In the following example, a text column in MINING_BUILD_TEXT is transformed to a nested table column in MINING_BUILD_NESTED_TEXT. The same text column in MINING_APPLY_TEXT is transformed to a nested table column in MINING_APPLY_NESTED_TEXT.
Both MINING_BUILD_TEXT and MINING_APPLY_TEXT have the following columns.
Name Null? Type --------------------------------- -------- --------------------------- CUST_ID NOT NULL NUMBER AGE NUMBER EDUCATION VARCHAR2(21) OCCUPATION VARCHAR2(21) COMMENTS VARCHAR2(4000)
The following statements create the indexes.
SQL> create index build_text_idx on mining_build_text (comments)
indextype is ctxsys.context parameters ('nopopulate');
SQL> create index apply_text_idx ON mining_apply_text (comments)
indextype is ctxsys.context parameters ('nopopulate');
The following statements create the index preference and its table.
SQL> execute ctx_ddl.create_preference('idx_pref', 'SVM_CLASSIFIER');
SQL> create table idx_pref_cat (id number, cat number);
The following statement returns the intermediate terms in the table BUILD_TERMS_OUT. It also creates the table FEATURE_DEFS and populates it with the term definitions.
SQL> create table build_terms_out as
select * from
table (ctxsys.drvodm.feature_prep
('build_text_idx',
'cust_id',
'idx_pref_cat',
'id',
'cat',
'feature_defs',
'idx_pref'));
The following statement returns the final terms in the table BUILD_EXPLAIN_OUT.
SQL> create table build_explain_out as
select a.sequence_id,
b.text,
a.value
from build_terms_out a,
table (ctxsys.drvodm.feature_explain('feature_defs')) b
where a.attribute_id = b.id;
The following statement creates the table MINING_BUILD_NESTED_TEXT. This table contains the non-text attributes from the original build table and a nested table of comments. This table can be used to build a model.
SQL> create table mining_build_nested_text
nested table comments store as build_comments
as select non_text.cust_id,
non_text.age,
non_text.education,
non_text.occupation,
txt.comments
from mining_build_text non_text,
(select features.sequence_id,
cast(collect(dm_nested_numerical(features.text,features.value))
as dm_nested_numericals) comments
from build_explain_out features
group by features.sequence_id) txt
where non_text.cust_id = txt.sequence_id(+);
The following statement creates the intermediate terms table for the comments column in the apply table, MINING_APPLY_TEXT. It uses the term definitions in the FEATURE_DEFS table, which was created during the pre-processing of the comments column in MINING_BUILD_TEXT.
SQL> create table apply_terms_out as
select * from
table (ctxsys.drvodm.feature_prep
('build_text_idx',
'cust_id',
'feature_defs'));
The following statement creates the final terms table for apply.
SQL> create table apply_explain_out as
select a.sequence_id,
b.text,
a.value
from apply_terms_out a,
table (ctxsys.drvodm.feature_explain('feature_defs')) b
where a.attribute_id = b.id;
The following statement creates the table MINING_APPLY_NESTED_TEXT. This table contains the non-text attributes from the original apply table and a nested table of comments. This table can be used to apply the model.
SQL> create table mining_apply_nested_text
nested table comments store as apply_comments
as select non_text.cust_id,
non_text.age,
non_text.education,
non_text.occupation,
txt.comments
from mining_apply_text non_text,
(select features.sequence_id,
cast(collect(dm_nested_numerical(features.text,features.value))
as dm_nested_numericals) comments
from apply_explain_out features
group by features.sequence_id) txt
where non_text.cust_id = txt.sequence_id(+);