Data Mining Techniques
Data mining techniques specify the type of data patterns discovered in data mining procedure. The researchers categorized data mining techniques into two types: descriptive and predictive. The descriptive data mining techniques emphasize on identifying human-interpretable patterns describing the data. They mainly outline the characteristic features of the database records. The examples are data summarization, association rule mining and clustering analysis. The predictive data mining techniques make inference on the present data to perform predictions. They use some properties or attributes in the database to predict unidentified or future values of other features of interest. The examples are regression analysis, classification and prediction. The present section describes different data mining techniques, and the types of data patterns they can determine.
1.3.1 Association Rule Mining
Association rule mining (alternatively known as frequent pattern mining) is an important data mining technique. The frequent patterns are patterns that occur in a data set regularly. Given a set of objects or items, for instance, bread, butter and milk that frequently occur together in a transaction record called a frequent itemset. Discovering these types of frequent patterns plays a vital role in mining relationships among data. Additionally, the procedure can be used for classification, clustering analysis, and other types of data mining tasks. It is a descriptive data mining technique.
Association rule mining (ARM) denotes the procedure of determining captivating and unforeseen rules from vast databases. The field indicates a very general model that helps in finding the relations between items of a database. An association rule is an if-then-rule supported by data. Initially, the association rule mining algorithms used for solving the market-basket problem. The problem was like this: given a set of objects and large amount of transaction records, the goal was to find associations between the objects confined to several transactions. A typical association rule resulting from such a research study can be “60 percent of all consumers who purchase a personal computer also buy antivirus software” – which reveals a very vital information. So, this analysis can provide new understandings of customer behavior and thereby leading to higher profits via better customer dealings, customer retaining and better product settlements.
Data classification is the method of determining a classifier or model that describes and discriminates several data classes from each other. Initially, the classification procedure applies some preprocessing tasks (data cleaning, data selection, data transformation etc.) to the original data. Then, the method divides the preprocessed data set into two different sections namely the training data set and the test data set. These data sets should be independent of each other to avoid biases. A classification technique is alternatively known as a classifier.
Classification basically consists of two different steps. The first step develops a classification model (i.e., classifier) indicating a well-defined set of classes. Therefore, this is the training phase, where the classification technique constructs the model by learning from a given training data set accompanied by their related class label attributes. That is why it is a form of supervised learning technique. After that, the classification model is suitable for prediction called the testing phase. This step estimates the accuracy of the derived model using the test data set. Classification is a predictive data mining task.
The classification procedure applies to huge information repositories for building models identifying diverse data classes. This kind of analysis can provide profound insight for better understanding of different large-scale databases. The resulting model based on the analysis of a training data set. The model can use several procedures, such as mathematical formulas, simple if-else rules, artificial neural networks, or decision trees. The software applications related to classification technique analyze large databases and develop meaningful classifications and patterns in the databases for scientific research, industrial, and commercial purposes.
In statistics and machine learning, regression analysis is a procedure for assessing the inter-variable relationships. It includes several methods for demonstrating and examining numerous variables, where the goal is to analyze the association amongst a dependent variable and one or more independent variables. Specifically, regression analysis can recognize which among the independent variables closely associated with the dependent variable, and to discover the forms of these relationships. It is in use for numeric prediction and forecasting. It tries to determine a function that represents data with least possible error. Regression technique is a predictive data mining task.
Unlike classification, clustering examines data objects without referring a known class label. The class labels are absent in the training data set as because they are unknown initially. Researchers employ the clustering analysis to create such labels. So, clustering is a form of unsupervised learning technique and is a preprocessing step for classification. Clustering groups objects according to the rule of maximizing the intra-class similarity and minimizing the inter-class similarity. To be specific, one should develop clusters in a manner that the objects belonging to the same cluster have great similarity compared to one another but are very different to the objects in other clusters. Every cluster generated is a group of entities or objects, leading towards the formation of new rules. Clustering is a descriptive data mining task. The comparison between clustering and classification techniques are presented in Table 1 below.
Table 1: Comparison of Clustering and Classification
|It is a descriptive data mining technique.||It is a predictive data mining technique.|
|It is an unsupervised learning.||It is a supervised learning.|
|Class label attribute is not present.||Class label attribute is present.|
|Examples of clustering techniques are K-Means, K-Medoids etc.||Examples of classifiers are Multi-layer Perceptron, Support Vector Machine etc.|
Data relates to classes or concepts very often. It could be beneficial to designate specific classes or concepts in summarized, brief, and yet precise terms. Such kind of description related to a class or a concept is the class description or concept description. Data summarization is an essence of the general features of the data in databases. The data conforming to the user-specified class is usually collected by executing queries on databases. For instance, to investigate the characteristics of electronic products whose sales increased by 20% in the year 2015, the data associated with such products could be accumulated by executing an SQL query. Summarization technique is a descriptive data mining task.
1.3.6 Outlier Detection
A database may consist of items or data objects that do not obey the general characteristics of the data considered as outliers or exceptions. This kind of analysis is called outlier detection or outlier analysis. In some applications, such as fraud detection, outlier analysis might expose the deceitful usage of credit cards. Outlier analysis detects this by identifying purchases of vast amount for a given account number compared to the usual transaction limit set by the account holder. Statistical tests based on probability distribution model and different forms of clustering can identify noise or outliers efficiently. Outlier detection is a predictive data mining task.