<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>George Papayiannis &#187; Data Mining</title>
	<atom:link href="http://www.sematopia.com/category/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.sematopia.com</link>
	<description></description>
	<lastBuildDate>Sat, 01 May 2010 04:50:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>k-means and EM clustering algorithms</title>
		<link>http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/</link>
		<comments>http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/#comments</comments>
		<pubDate>Sun, 16 Apr 2006 20:32:33 +0000</pubDate>
		<dc:creator>George A. Papayiannis</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[McMaster]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Tutorials]]></category>

		<guid isPermaLink="false">http://www.sematopia.com/?p=72</guid>
		<description><![CDATA[The final data mining assignment (which was due in the middle of exams) was to implement the k-means and EM algorithms in C.  These are two pretty simple clustering algorithms, with k-means (or renditions of k-means) used in industry all the time.  Both algorithms work in 2-dimensions and take an input file of [...]]]></description>
			<content:encoded><![CDATA[<p>The final data mining assignment (which was due in the middle of exams) was to implement the k-means and EM algorithms in C.  These are two pretty simple clustering algorithms, with k-means (or renditions of k-means) used in industry all the time.  Both algorithms work in 2-dimensions and take an input file of data points separated by a space.</p>
<p>K-means <a href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html">works as follow</a>:</p>
<ul>
<li>Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.</li>
<li>Assign each object to the group that has the closest centroid.</li>
<li>When all objects have been assigned, recalculate the positions of the K centroids.</li>
<li>Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.</li>
</ul>
<ul>
<ul /></ul>
<p>The EM algorithm is a <a href="http://www.cas.mcmaster.ca/~cs4tf3/">bit more complicated</a>, but has generally the same idea:</p>
<p><img id="image71" alt="em" src="http://www.sematopia.com/wp-content/uploads/2006/04/em.jpg" /></p>
<p>Since this was in the middle of exams, I wrote both solutions (<a href="http://www.sematopia.com/wp-content/uploads/2006/04/kmeans_em.zip">code</a>) in one night (7pm to 3am), so no guarantee whatsoever.  Take a look; if you&#8217;d change anything let me know.  The code is licensed under <a href="http://www.gnu.org/licenses/gpl.txt">GPL v2</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Apriori Online &#8211; Mining for Associations</title>
		<link>http://www.sematopia.com/2006/04/apriori-online/</link>
		<comments>http://www.sematopia.com/2006/04/apriori-online/#comments</comments>
		<pubDate>Thu, 06 Apr 2006 01:24:40 +0000</pubDate>
		<dc:creator>George A. Papayiannis</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[McMaster]]></category>
		<category><![CDATA[Projects]]></category>

		<guid isPermaLink="false">http://www.sematopia.com/?p=70</guid>
		<description><![CDATA[Why are associations interesting?  For the same reason data mining in general is interesting &#8211; you can learn things you would have never thought of.  Take Shoppers Drug Mart as an example, don&#8217;t quote me on this, but I bet when you buy a bulk of goods (and swipe you Shoppers Optimum Card) [...]]]></description>
			<content:encoded><![CDATA[<p>Why are associations interesting?  For the same reason data mining in general is interesting &#8211; you can learn things you would have never thought of.  Take Shoppers Drug Mart as an example, don&#8217;t quote me on this, but I bet when you buy a bulk of goods (and swipe you Shoppers Optimum Card) some database is registering your personal information (age, sex, location, etc.) and everything you purchased (Tylenol, Kleenex, gum, etc..).</p>
<p>With high confidence (no Apriori pun indented) I can say that periodically Shoppers Drug Mart mines that data for interesting association, like perhaps:</p>
<p>- Males between the ages of 25 and 45, between the hours of 7:00pm to 12:00am purchase toilet paper and Mach 3 razors.</p>
<p>What would Shoppers do?  In the evenings they would set up and isle display of expensive toilet paper.  At the front of that isle (as the customer is walking to the cashier) they would set up another display of Mach 3 razors…10 packs.  They&#8217;ll do this because they&#8217;ve found associations, which tell them with high confidence; males will purchase these two items together.</p>
<p>For a data mining project this past semester, one of the deliverables we created was a web based version the <a href="http://apriori.sematopia.com">Apriori algorithm</a>.  The algorithm was initially developed by Agrawal et al (Agrawal 93, Agrawal 94) and is used to find association rules from a dataset.  This online version of the Apriori algorithm is based off an implementation by <a href="http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori.html">Christian Borgelt&#8217;s</a>.</p>
<p>The URL is <a href="http://apriori.sematopia.com">http://apriori.sematopia.com</a>.  Aside from simply executing the Apriori implementation by Christian, the web-based application does a series of pre-processing on the data.  To begin, all missing values in a given numeric column are filled in with the average of that attribute.  Then categorization occurs on all numeric attributes to place each column value in a bin.  This is based off of the Category Granularity (how many categories to make in a given column) and the Categorization Threshold (if the number of unique values in the column are < this number, then categorization will not be performed).  There are many options (besides Category Granularity &#038; Categorization Threshold) available that can be set to adjust the results Apriori returns; these include the max/min support, the min confidence, min sets per rules, etc.</p>
<p>The file uploaded must be in CSV format with the first row as headings, and each subsequent row being observations.  The max file size that can be uploaded is 200,000 bytes.  After the algorithm finishes (which should take less than 3 seconds) the first 100 association rules will be shown.  You can download the full output from the website.</p>
<p>The associations produced will look like this:</p>
<p>verbal_sat=c3_g7(442,469) <- math_sat=c2_g7(500,530) act=c4_g6(21,23)  (36.1, 93.0)</p>
<p>c3 means column 3<br />
g7(442,469) means group (bin) 7 were this specific bin was between 442 and 469<br />
(36.1, 93.0) at the end of each rule, the support and confidence is given</p>
<p>Each IP address will be allowed to run 5 datasets per day.<br />
Good luck, happy mining.<br />
<a href="http://apriori.sematopia.com"> http://apriori.sematopia.com</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.sematopia.com/2006/04/apriori-online/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
