Monday, January 6, 2014

Clustering algorithms types: partitional clustering and hierarchical clustering

1. Flat or Partitional clustering:
(K-means, Gaussian mixture models, etc.)
Partitions are independent of each other

2. Hierarchical clustering:
(e.g., agglomerative clustering, divisive clustering)
- Partitions can be visualized using a tree structure (a dendr
ogram)
- Does not need the number of clusters as input
- Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K


 K-means variants:
-Hartigan’s k-means algorithm
-Lloyd’s k-means algorithm
-Forgy’s k-means algorithm
-McQueen’s k-means algorithm


a good article about cluster analysis in R.
============================================
Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data
=======
What do you mean by representing the categorical fields by 1:k?

becomes


That guarantees your results are worthless unless your categories
have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):


Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352
===============
Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  require("cluster")
  ?daisy
  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a. dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

Martin
===============================
The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library.
==================================
The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function.
source: http://www.stat.berkeley.edu/classes/s133/Cluster2a.html
===================================

Saturday, January 4, 2014

Redshift convert integer to Timestamp type and convert epoch time to timestamp on the fly

Currently, Redshift does not support to_timestamp function, which is convenient for converting String or Integer into Timestamp format. Below is a quick notes on how to convert String or integer to be timestamp in Redshift.
Simple, using substring function to get parts for Year, Month, Date, Hour, Minute, Second from String or Integer and concatenate them into Timestamp format and cast to timestamp. Below are examples,
Suppose Table t1 has a column (col1) have integer value in pattern YYMMDDhhmm. For example, it value is 0403051030. It means Year 2014, March fifth, and thirty minutes past ten o'clock. We can use below two methods to convert it to be Timestamp value.
SELECT 
cast(('20' || substring(col1,1, 2) || substring(col1,3, 2) || 
      substring(col1,5, 2) || ' ' || substring(col1,7, 2) || ':' 
      || substring(col1,9, 2)) as timestamp) 
FROM t1;
Or

SELECT 
cast(to_date(col1, 'YYMMDD') || ' ' 
    || substring(col1,7,2) || ':' 
|| substring(col1,9,2) as timestamp) 
FROM t1;
Also, get idea from a blog about how to convert epoch time to timestamp on fly in Redshift.

SELECT  
    (TIMESTAMP 'epoch' + myunixtime * INTERVAL '1 Second ') 
AS   mytimestamp 
FROM  
    example_table