Apriori Online – Mining for Associations

Why are associations interesting? For the same reason data mining in general is interesting – you can learn things you would have never thought of. Take Shoppers Drug Mart as an example, don’t quote me on this, but I bet when you buy a bulk of goods (and swipe you Shoppers Optimum Card) some database is registering your personal information (age, sex, location, etc.) and everything you purchased (Tylenol, Kleenex, gum, etc..).

With high confidence (no Apriori pun indented) I can say that periodically Shoppers Drug Mart mines that data for interesting association, like perhaps:

- Males between the ages of 25 and 45, between the hours of 7:00pm to 12:00am purchase toilet paper and Mach 3 razors.

What would Shoppers do? In the evenings they would set up and isle display of expensive toilet paper. At the front of that isle (as the customer is walking to the cashier) they would set up another display of Mach 3 razors…10 packs. They’ll do this because they’ve found associations, which tell them with high confidence; males will purchase these two items together.

For a data mining project this past semester, one of the deliverables we created was a web based version the Apriori algorithm. The algorithm was initially developed by Agrawal et al (Agrawal 93, Agrawal 94) and is used to find association rules from a dataset. This online version of the Apriori algorithm is based off an implementation by Christian Borgelt’s.

The URL is http://apriori.sematopia.com. Aside from simply executing the Apriori implementation by Christian, the web-based application does a series of pre-processing on the data. To begin, all missing values in a given numeric column are filled in with the average of that attribute. Then categorization occurs on all numeric attributes to place each column value in a bin. This is based off of the Category Granularity (how many categories to make in a given column) and the Categorization Threshold (if the number of unique values in the column are < this number, then categorization will not be performed). There are many options (besides Category Granularity & Categorization Threshold) available that can be set to adjust the results Apriori returns; these include the max/min support, the min confidence, min sets per rules, etc.

The file uploaded must be in CSV format with the first row as headings, and each subsequent row being observations. The max file size that can be uploaded is 200,000 bytes. After the algorithm finishes (which should take less than 3 seconds) the first 100 association rules will be shown. You can download the full output from the website.

The associations produced will look like this:

verbal_sat=c3_g7(442,469) <- math_sat=c2_g7(500,530) act=c4_g6(21,23) (36.1, 93.0)

c3 means column 3
g7(442,469) means group (bin) 7 were this specific bin was between 442 and 469
(36.1, 93.0) at the end of each rule, the support and confidence is given

Each IP address will be allowed to run 5 datasets per day.
Good luck, happy mining.
http://apriori.sematopia.com

Related Articles:

Have a second? Check out this great Canadian Health & Living Store based in Toronto

4 Responsesto “Apriori Online – Mining for Associations”

  1. Zohaib says:

    Hey Georgy,
    The project was actually pretty interesting to work on, good work on the site.

  2. Jeff Hagelberg says:

    Is the source code for this available anywhere?

  3. shko says:

    hello , i am a student of engineering software i want some idea about data mining
    thanks

  4. George A. Papayiannis says:

    If your looking to learn about Data Mining, the best way to learn is get a poplular book and start reading.
    I remember that a great DM book was available free online, ask around in your department, someone might know about it.

    I’ve taken down the online portion of this software for now, you can use the command line version from Christian Borgelt (read post above).

Leave a Reply

Line and paragraph breaks automatic.
XHTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>