Before data can be used to design a neural network, four steps in data preparation might be applied.
Figure 3.1: Data preparation for neural network design.
- Raw data is first collected.
- In data processing step, data-set can be cleaned by removing corrupted and incorrect records. Transformation techniques may be used to achieve useful features or to reduce data dimension. Categorical variables in data-set are also converted to numerical values that can be used.
- Data labeling might be applied to label targets.
- Data-set is then divided into three sets: Training set is used to train a neural network, a validation set is used to prevent over-fitting issue, and the test set is used to evaluate the how well the trained neural network could cope with completely new data-set
Figure 3.2: Data-set format
The data-set is presented in comma separated values (csv) format. The first row contains the header file that includes the inputs and targets' names. The key words (output, target, class) must be used to identify targets from inputs' names.
Each row, from second row, represent a data sample or an observation.
In Figure 3.2, the IRIS data-set has 4 inputs (sepal length, sepal width, petal length and petal width) and 3 targets (setosa, versicolour, and virginica). The target is labeled by the combination of three binary states: 100 (setosa), 010 (versicolour), and 001 (virginica). The sample/observation shown in row 5 indicates that if the sepal length =4.6 and sepal width =3.1 and petal length =1.5 and petal with =0.2 then the target will be setosa (100).
Load data-set in ANNHUB
Figure 3.3: Loading data-set in to ANNHUB environment.
Once data-set is ready, it can be loaded into ANNHUB by providing/browsing to data path as shown in Figure 3.3. In this step, overview of the data-set content will be displayed in training data view. Most importantly, based on that data-set structure, ANNHUB will select correct type of the Neural Network and give recommendation on how to Configure Neural Network, including pre and post processing methods, number of input, hidden, and output nodes, activation function types, error function type, training data ratio and training algorithms.