EnginSoft - A modeFRONTIER Application: Data Mining the Italian Municipalities
EnginSoft
22-23 October 2012 Pacengo del Garda
(VR) - Italy

www.caeconference.com

2011 Conference Proceedings
2011 Conference Proceedings are now avaliable to download
2006-2010 Proceedings

download CAE proceedings

CHOOSE YOUR COUNTRY

Virtual Prototyping in Italy Virtual Prototyping in Germany Virtual Prototyping in France
Virtual Prototyping in Spain Virtual Prototyping in Norway Virtual Prototyping in United Kingdom
Other Countries...



A modeFRONTIER Application: Data Mining the Italian Municipalities


Nowadays we are frequently bombarded with large quantities of data describing aspects of our world or work in great detail. A very common problem is that there is simply too much data for us to interpret sensible meaning or patterns. A new discipline, data mining, is rapidly developing to describe the process of extracting meaningful patterns from these complex data sets. The purpose of this paper is to show that modeFRONTIER can be considered as a powerful data mining tool, well-equipped to perform the widest possible range of important data mining tasks.
What makes good data mining software? The most basic requirement is easy access to the data. This implies a wide range of flexible import facilities from a variety of formats. Once the data is accessible, it will be important to have flexible and adaptable analytical techniques, including powerful numerical methods. A rich repertoire of charts and tables will certainly be necessary for visualizing the results, and good mechanisms for exporting these conclusions to other software for presentation and documentation.
This paper will illustrate how modeFRONTIER meets all these requirements. Through a comprehensive example, it will show how modeFRONTIER offers a delicious recipe based on the mixing of these important ingredients.


spot
Figure 1: on e step of the Data Wizard. The tool for importing data into modeFRONTIER


spot
Figure 2: The database view as it appears in modeFRONTIER

 


Database description
The database presented in this article comprises the most relevant characteristics of all Italian municipalities in terms of population, latitude, longitude, age, average income etc. This database was created by extracting the data from the website
“http://www.comuni-italiani.it/”
In fact, the database was established using an automatic mining system that identifies and extracts regularly structured data from web pages. The database collects all the Italian municipalities with each row containing 12 specific elements:

  • The name of the municipality
  • A unique identification number for the municipality
  • The total surface area of the municipality in square kilometres
  • The latitude and the longitude of the town hall in decimal degrees
  • The elevation in meters above mean sea level of the town hall
  • The total number of residents in 2005 (males and females)
  • The density defined as the ratio between the total population and the surface area
  • The population ageing index as the ratio between residents older than 65 and residents younger than 14
  • The total number of households
  • The average incomes (in Euros) declared in 2005 by the people living in the municipality
  • The total number of houses
  • The heating degree day (HDD) that is a quantitative index that reflects the demand for energy to heat a home or business. HDD is reported in [KD/a], Kelvin days per year.

Obviously, the database could be much larger, but to focus on this example, we limit our table to the mentioned data. We will learn that we can extract interesting deductions, even by analyzing this reduced data set.
The first step for data mining is to input raw data in a very convenient way. In modeFRONTIER, the loading and filtering of data is really easy. This can be done from formatted files, EXCEL tables or from relational databases, such as, for example MySQL and Oracle. During the import phase, the user may remove rows and columns containing useless data, specify the role of each column, insert objectives and constraints if any, and set up the visualization format for numbers. Moreover, thanks to modeFRONTIER’s worktable capabilities, it is also possible to insert additional columns containing derived data.
Once the data is well organized in a table, the data mining software should help the users in identifying important patterns that are not visible at first glance due to the quantity of data and/or the high dimensionality of the problem.


Furthermore, and thanks again to the worktable capabilities, it is also possible to insert additional columns containing derived data. In this example, we can introduce the average number of people per household in Italy. This value can be derived quite easily by introducing the transfer variable averageHousehold = (male + female) / households as reported in Figure 3.

spot
Figure 3: Derived Data may be easily introduced in the database view



When a derived column is introduced, it is possible to make queries according to this new information. For example, in this case, it is now very easy to determine the Italian cities with the highest and lowest household size, these turn out to be Foggia and Trieste respectively.


Analyzing the data
We can start analyzing the database in modeFRONTIER with the help of available charts beginning with the simplest chart, the history plot. Figure 4 reports the history plot of the average incomes of all Italian cities, the red lines point out the income gaps between the highest (Milano with 30,973 Euros) and lowest (Sanluri with 15,589 Euros) earners in Italy. Similarly, we can even plot a multi-history chart, putting in the same plot for the number of houses and the number of households. Now, we may ask how many houses are occupied and how many are vacant or second homes (for example holiday accommodation).


spot
Figure 4: Pointing out the gap between highest and lowest earners in the big cities

 


To answer to this question, we introduce an additional column that counts the number of vacant houses with respect to the total number of houses in the territory.
vacantHouses=(houses-households)/houses


Figure 5 shows on the left a multi-history chart with houses and households, and on the right, the distribution of the vacant houses in Italy represented with a box-whiskers plot. The multi-history chart shows that there is a strong relationship between the number of houses and the number of families in the territory, pointing out that Roma is the most populated city. The box-whiskers plot can be used to visualize the distribution of data in an effective way, summarizing some information of the data, such as the mean and its confidence interval, the quartiles and outliers. These last ones are the designs which fall out of an interval centered in the mean and with semi-amplitude of 1.5 the standard deviation. Our box-whiskers chart shows up the situation of Agrigento, a city with 43% of vacant houses - a real outlier.


spot
Figure 5: on the left a multi-history chart with houses and households, on the right the distribution of the vacant houses in Italy represented with a box-whiskers plot.

 


How can we identify easily all the other cities with many vacant houses and check where these cities are located? This is accomplished by clicking on all the outlier points in the box-whiskers chart and plotting a 2-D scatter chart using the longitude and the latitude. In Figure 6 (left), we can see that the cities with the higher number of vacant houses are located near the coasts, and this is probably justified by the higher proportion of holiday homes in those areas.
Figure 6 (right) represents a bubble chart with the color of the bubbles indicating the value of the incomes. The bubbles provide a way for displaying a third variable in a two dimensional chart. The bubble chart shows, at a first glance, that there are income gaps between the average values of the North and the values of the South of Italy.
By means of the parallel coordinate chart, it is possible to create filters on the data.


spot

Figure 6: on the left a 2D Scatter chart with highlight on the cities with many vacant houses, on the right a bubble chart with the color given by the outcomes

 


Figure 7 shows two examples of filters created on the database view. The filter on the left is looking for cities with high density and low incomes; the selected line represents the city of Napoli. In the other chart (on the right), the opposite situation is created, the blue lines create a filter for selecting cities with high incomes and low density. The two lines surviving the filter are Parma and Siena. This means that Parma and Siena share the common characteristics of wealth and spaciousness.

spot

Figure 7: Creating filters on the data. On the left, the red filters are looking for cities with high density and low incomes. On the right, the opposite situation is created, with a filter for selecting cities with high incomes and low density

 


The correlation matrix and the scatter matrix are other useful tools available in modeFRONTIER, to check if there is any relation between the data. The correlation coefficient is a measure of the closeness of the linear relationship between two variables. The correlation coefficient is a pure number which can range from -1.00 to +1.00, a value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. Positive values of the correlation indicate a tendency of the two variables to increase together. On the contrary, when the coefficient is negative, large values of the first variable are associated with small values of the second one. In the scatter matrix, the scatter plots together with the regression lines help to visualize the data and understand how they are distributed in the space. All the graphs can be enlarged, printed and explored.


In this example (Figure 8), it can be seen that the most important negative correlations involve the ageing and the average household size (-0.820). This can be explained by the fact that when the age of the population increases, it is more common to find single person households, widow and widower for example. Apart from some expected relation between number of people and houses, one of the most important positive relations involves the heating degree day (HDD) and the latitude (0.805): this means, that the heating requirements in the northern parts of Italy are much higher than in the South.
So far, the charts have been created only taking into account the biggest cities and not all the Italian municipalities. Now, we are going to import the complete list of Italian municipalities (more than 8,000) and compare the values between municipalities and big cities.


spot

Figure 8: Scatter matrix of the data, this chart summarizes in a single chart the correlation chart, all the 2D scatter charts with the regression line.

 


For example, it is very interesting to compare the incomes between big cities and rural areas. This is possible in modeFRONTIER because we can maintain more than one view in the same project and compare and combine the tables. Figure 9 shows the difference of the distributions of incomes in the big cities with respect to the incomes in all Italian municipalities. Apart from Basiglio where people are so rich that can be considered “out of the range”, the average incomes in the cities are usually much higher than the average household incomes in the countryside. To be precise, the average income in the cities is 21,833 and 17,320 across all Italian municipalities, a difference of more than 4,000 Euros per year.
When we use some statistical charts to evaluate the incomes, we discover that 90% of Italian people are earning less than 21,000 Euros per year.


spot

Figure 9: The box-whiskers plots companing the incomes between big cities and countryside.



spot

Figure 10: A very nice view of Italy, a bubble chart with longitude, latitude and elevation



spot

Figure 11: The cumulative distribution function chart showing that only 10% of Italian population declare more than 21.000 Euros per year

 


Multivariate Analysis
Multivariate Data Analysis (MVA) refers to any statistical technique used to analyze data that arises from more than one variable. Traditional tools allow only a partial representation of a multivariate database, hence the user may not have a global view of the dataset and therefore, data mining could be extremely difficult and unfruitful.
MVA can be used to:

  1. Summarize complex data tables.
  2. Analyze groups in data, to inspect how these groups differ and to assign each instance to a group.
  3. Find relationships between the variables and how some variables can characterize different groups.
    modeFRONTIER has a multivariate analysis environment which contains the following tools: principal component analysis (PCA), multi-dimensional scaling (MDS), self organizing maps (SOM), partitive clustering (K-means), and hierarchical clustering. When available information is stored in huge tables containing rows and several columns, this environment (see Figure 12) can be used to process the information in a meaningful fashion.


spot

Figure 12: Multivariate analysis panel: from here the user can access all the data analysis tools available in modeFRONTIER



Multidimensional scaling
We can now start using this multivariate analysis tool to analyze the data of the Italian municipalities using, for example, the multi-dimensional scaling (MDS) tool. The goal of MDS (Cox 94, Borg, 97) is to produce a low-dimensional representation of multivariate samples such that distances between the points in the lower dimensional space best match those in the original space. This way, a one-to-one correspondence between data and projections is achieved, usually in a 2-dimensional space.
In this example, we can try to project in a 2-dimensional space, the 3-dimensional problem of the Italian cities with their longitude, latitude and incomes. If you run a MDS on the data, for each city, a corresponding location in the 2-dimensional chart is determined in order to preserve as much as possible the interpoint distances of the data. What happens to our data is really interesting. Figure 13 shows the difference between a longitude-latitude plot of Italy and a projection of Italy according to the incomes. The colors in the chart are given by the regions and are very helpful in verifying that the structure of the country and its regions are more or less maintained in the projection, we may say that the incomes are quite uniform within the regions. It is interesting to note that the Northern part of Italy is a bit stretched in the MDS projection due to the positive economic situation of Lombardia, which is represented by purple points. In the MDS chart, it can be even noted that the only city which is moving away from all the other cities of its region is Roma. This testifies to the fact that Roma is a rich city in the middle of a less favorable region.


spot

Figure 13: On the left a scatter plot of the longitude and latitude on Italian cities with colors given by the different regions. On the right, the scatter plot represents a projection of the cities given the incomes as a third dimension

 


Principal Component Analysis
Principal component analysis (PCA) is a method which finds projections of maximal variability. This means that it searches for linear combinations of the data. It can be used for multidimensional data visualization and for multivariate correlation analysis.
If we run a PCA on the Italian municipalities database, we discover that we could potentially restrict our database into five dimensions preserving more than 90% (see Figure 14) of the information coming from the data: all the other dimensions may be considered linear recombinations of these former dimensions. The projection indicates that there are several variables sharing a similar trend (e.g. male, female, density, households) and that some of them may be neglected for further inspections.


spot

Figure 14: On the left, eigenvalues and cumulative weight of the PCA on the municipalities database. On the right, the projections and the contribution of the original variables.

 


Clustering
The aim of clustering is to find patterns in data. A cluster is a subgroup of data which are “similar” between each other and are “dissimilar” to those belonging to other clusters. Generally, the similarity criterion is given by the Euclidean distance but many other measures can be considered. Clustering methods can be divided into two types: hierarchical and partitional clustering (K-means).
In this section, we will perform the hierarchical clustering of the Italian municipalities, although we should note that other interesting results could have been obtained by the application of different methods.
Hierarchical clustering proceeds by iteratively merging smaller clusters into larger ones. The result of a hierarchical clustering algorithm is a tree of clusters called dendrogram, which shows - from a tree view -how the clusters are related. By cutting the dendrogram at the desired level, a clustering of the data into disjoint groups is obtained. For example, in our case trying to create a hierarchical clustering on all the data excluding the latitude, the longitude and the identifier to avoid any bias on the results, we obtain the dendrogram in Figure 15 where a subdivision of the cities into three different groups is proposed.
If we apply the proposed clustering to the data, we can plot the chart in Figure 15 (right) and we can, once more, verify that even if the latitude and the longitude are not taken into account, all the other values (the households, incomes, HDD, inhabitants, …) can be sufficient to recognize some “dissimilarities” between the North, Center and South of Italy.

spot

Figure 15: On the left, a dendrogram of the database, 3 clusters are proposed as a good clustering. On the right the clusterized Italy




Curiosities
Without creating any charts we can discover some of the extremes in Italy by selecting the maximum and minimum of the columns of cities and municipalities. For example, we can easily see that the most elevated town of Italy is Sestriere (in the province of Torino), the highest city is Enna (931 m), and the lowest city is not surprisingly Venezia (2 m), while Taglio di Po, Comacchio and Lagosanto are at sea level. We have the highest density of population in Portici (in the province of Napoli) and the lowest density in Briga Alta (in the province of Cuneo); the highest incomes are in Basiglio and in Milano as a big city, and the lowest incomes are in Maniace (Catania) and Sanluri. Otranto (Lecce) citizens are the first Italians to greet the rising sun, since the town has the highest longitude, while Bardonecchia (Torino) is the most western place of the country.


Conclusions
In this article, the authors drew a range of conclusions starting from a database of numbers; this means that authors transformed data into knowledge. There is a tendency among many people to refuse to believe these types of statistical studies, particularly when the results of the studies do not respond to their previous beliefs. It is well known folklore that statisticians have the ability to obtain any result that they wish simply by massaging the data.

In truth this is not correct. The real issue is that data can be examined with an infinite range of techniques. The careful selection of an appropriate technique crucially depends on the question that we are trying to answer. For instance, in our database of Italian municipalities, we concentrated more on the incomes than on the other characteristics.

The same multivariate analyses tools applied to other data, such as the aging of the households, could have delivered completely different information. Thus we should emphasise that these pages do not represent a basis for a complete description of Italian economics, Italian habits or happiness. The aim of this article is simply to show typical instruments that can be used for data mining and to demonstrate how this can be easily achieved with modeFRONTIER.


References
Borg, I., & Groenen, J. F. (1997). Modern Multidimensional Scaling: Theory and Application. Springer.
Cox, T. F., & Cox, M. A. (1994). Multidimensional Scaling. Chapman & Hall.
Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: a review. ACM Computing Surveys, 31.
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Springer.


For further information:
info@enginsoft.it

copyright © 2011 all rights reserved | terms of use | Download EnginSoft Logo | VAT nb IT00599320223