Introduction to DM and ML

Data mining is essentially the procedure of extracting interesting information from large amount of data and transforming it into meaningful knowledge. Considering the above definition, the term “data mining” is certainly inappropriate. The appropriate name should have been like knowledge mining from data; though it is inelegantly somewhat lengthy. But somehow such an inappropriate term that brings both “data” and “mining” came to be a popular choice in the research community. Many other terms are in the pieces of literature that give a similar otherwise marginally different sense to it, for example, knowledge extraction, data/pattern analysis, and data archeology.

Data Mining is one of the phases in Knowledge Discovery in Databases process or KDD process in short. The terms knowledge discovery in databases, and data mining are entirely distinct from each other. The KDD process denotes the general procedure to determine worthwhile knowledge from large amount of data; while data mining refers to the method of extracting interesting patterns from data based on analysis. Data mining as a phase in the KDD process is shown below in Figure 1.

Figure 1: Data mining as a phase in the KDD process

The knowledge discovery in databases process contains an iterative sequence of the following steps as described in Figure 1:

1. Data selection: This is an essential procedure where the data related to the investigation task assembled from different data sources. The data sources may include data warehouses, on-line transaction records, relational databases, flat files, spreadsheets, or other kinds of information repositories. The resulting data is the target data set.

2. Data preprocessing: This denotes several preprocessing tasks applied to the target data set to ensure consistency in naming conventions, encoding structures, and attribute measures. Preprocessing includes data integration and data cleaning. Data integration is a form of preprocessing that may combine multiple data sources. The data set then cleaned where the term ‘cleaning’ denotes the processing of data for reducing noise and the treatment of missing values.

3. Data transformation: The procedure applies to preprocessed dataset prior to data mining. For example, this method is in use to normalize the dataset as because neural network and regression-based techniques require distance measurements for analysis. It transforms database attribute values to a small-scale range such as [-1.0, +1.0] or [0.0, 1.0]. Occasionally researchers follow aggregation or consolidation approaches for performing data transformation.

4. Data mining: Data mining is the procedure of discovering interesting patterns and knowledge from large amount of data. The procedure might refer to a knowledge base, which is a repository of information related to a particular domain that would help the searching procedure for finding the interesting patterns.

5. Pattern evaluation: This denotes the process applied to recognize the interesting knowledge based on some thresholds or interestingness measures. Finally, the extracted knowledge is presentable to the user using some visualization techniques.

Steps 1 to 3 denote different forms of earlier tasks, where the data typically prepared for mining. Though data mining is a step in the KDD process; however, the term data mining is becoming a universal standard compared to the elongated expression of “knowledge discovery in databases”.

The data mining step may involve ML based techniques for knowledge discovery. Basically, several machine learning techniques such as regression, classification and clustering are considered to be integral parts of data mining.