Data Mining

Large organizations tend to have large databases which encapsulate their business data. The databases will include information about everything from their customers and clients to their sales to their offices to their earnings to their products to their employees and so on. While there is an enormous about of data stored in databases in an obvious, straightforward fashion, the idea underlying data mining is that there may be even more information available in a database that isn't obvious at first glance. For example, maybe when the company stock price rises, sales in Los Angeles and New York tend to increase whereas sales in Sacramento and Houston decline. That isn't something which would be obvious at first glance, and yet it is very useful information to have. Data mining is the process of getting this kind of information out of the database.

To a certain point, some businesses have been involved in similar processes for years if not decades or centuries. A good example is in insurance where actuaries use statistical principles to determine the likelihood of a given driver having an accident, and that is used to set the insurance rates for that driver. So in that sense, data mining isn't new-it is just a relatively new name for an old process. However, data mining attempts to go beyond the statistical processes of old and to find patterns in the data which were not even anticipated by the business people using the data mining algorithm.

One reason for this is that there are large amounts of data available which were not necessarily available in the past. Data mining can look not only at a company's own business data but also at the data not directly connected with the company to see if there is any connection or correlation. What we are looking for here is simply connections between various items of data. We aren't necessarily looking for cause-and-effect here-data mining usually does not tell us which piece of data causes the other to happen.

There is almost no limit to the combinations of data which can be compared to determine whether there is a correlation or pattern. Not only will there be any number of raw data items, but it is possible to compare data from different points in time-for example, sales two days after a change in the stock price. It is also possible to come up with any number of arithmetic combinations of the data to see if that yields any statistical correlations. As with other forms of artificial intelligence like neural networks or time series prediction, it is necessary to avoid overfitting the data-that is, looking so hard for possible correlations that you find correlations that aren't really there.

When a data mining algorithm finds some kind of correlation, the next thing which needs to be determined is what to do with it. The data mining algorithms will tend to find business rules which can then be encoded (either automatically or manually) in a rule based system. There will tend to be two types of rules that are found by data mining algorithms. Some rules will be hard-and-fast business rules which may or may not have already been encoded in the system. For example, if the data mining algorithm finds that all homes sold in a particular area of California have earthquake insurance, that may be a requirement which should be encoded as a business rule. Such apparently hard-and-fast rules should be investigated further if they are not already encoded in the system-it may be that certain rules are well known to the business people but have never been formally represented in the system.

The other type of business rule-which is the main thrust of data mining algorithms-will not be hard-and-fast business rules but statistical rules. The following might be an example:

     Competitors stock price on day n drops more than 10%
     Sales will, on average, rise 8% on day n+2.

As noted earlier, these rules will be incorporated into a rule based system. Usually, such rules generated by the data mining algorithm must be combined with other, more stable, rules generated by a human being in order to get meaningful performance from the system. However, one advantage to the data mining algorithm over having a human produce rules manually is that the data mining algorithm can be run frequently and so there is no need for time-consuming human intervention every time the business rules change.

Data mining algorithms were originally explored in the early to mid-1990's for such disparate industries as tax, finance, retail and insurance. Pretty much any industry with a good deal of data potentially could use data mining algorithms. There was a limit to how far data mining could be applied at that time because it was a bit cumbersome to find an intelligent way to use the results of data mining algorithms-technology just wasn't operating on the 24x7 paradigm which was needed to make data mining a complete success. More recently, however, data mining has enjoyed a tremendous resurgence in Internet and e-commerce applications, especially in order to assist in personalization applications. The Internet just seems to provide the right environment in terms of speed and responsiveness in order to make the use of data mining algorithms realistic.

For more information on data mining you might want to read Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations or Data Mining: Concepts and Techniques.

Next Edition: Natural Language


Home: Ramalila.NET



All copyrights are maintained by respective contributors and may not be reused without permission. Graphics and scripts may not be directly linked to. Site assets copyright © 2000 RamaLila.com and respective authors.
By using this site, you agree to relinquish all liabilities and claims financial or otherwise against RamaLila and its contributors. Visit this site at your own risk.