29 Feb 2016
Clustering is a kind of unsupervised machine learning method which does not need any predefined labels to describe the entities. The clustering algorithms can classify the observations to several groups by operating on the relevant features to minimize the variations within groups.
Read more
18 Feb 2016
Data set iris which contains 150 samples with 4 morphological variables in three species of iris is wildly used by data scientists to demonstrate certain machine learning algorithm. I use k-nearest neighbors method[1] to classify the species based on their sepal length and width and petal length and width.
Read more
17 Feb 2016
Global map
Inspired by these posts (this, this and this), I create a spherical map as follow:
Read more
16 Feb 2016
A picture is worth a thousand words. And It will be more funny and effective if we can interact with the chart.
The hardwares and the web technologies make it possible for average person to achieve interactive data visualization on the Internet. Among these web technologies, JavaScript equipped with many extension packages is the most powerful one. Here is an example using D3.js to make an interactive stacked bar chart.
Read more
03 Feb 2016
R was born to be single threaded, so it will be slow for big data sets or heavy time consuming computations. We can use some third-party packages to parallelize our task on multiple cores or machines. Normally, we define the number of cores, and initialize the workers which computations can be sent to first. Then split the task into pieces and distribute the sub-tasks to workers. Shut down the cluster and do more analysis on the results at last. Here, I use the functions in package snowfall which depends on snow to give an example.
Read more
27 Jan 2016
Spatial analysis
In the field of my study, many datasets contain the information of the location, and these information will impact the results of the analyses. Moreover, we will always acquire new knowledge from these spatial data. For the location-based data, specific analytical techniques will be used. Here I list some very very basic functions in R for spatial analysis. I suggest two books Applied Spatial Data Analysis with R and An Introduction to R for Spatial Analysis and Mapping for further reading.
Read more
25 Jan 2016
Model selection
When we have multiple competing models (corresponding to various hypotheses), we need some techniques to compare the performance of the models, then select the best model(s) to support specific hypothesis.
Read more
22 Jan 2016
General linear model
Linear regression is an important foundation of many statistical technologies, while I have no ability to give more mathematical formulas or theories here. I only want to emphasis some of its key assumptions, linearity, homoscedasticity, independence, and normality1, 2, 3.
Read more
20 Jan 2016
Phylogenetic relationship
As last post says species traits are more similar between relative pairs while less similar between distant pairs. We need some mathematical definitions of the relationship for further calculations. Phylogenetic tree can quantify the relationships between species in certain scale. In other words, the distance matrix of the species’ relationships was demonstrated in a form of a tree. There are many methods and tools to construct a tree under different assumptions with multiple algorithms, based on the sequences of DNA or amino acid or other distinguishable traits. Even better, many supertrees have been built suitably at the level of species.
There are some tutorials about the primary phylogenetic analyses in R (this, this, this and this). I will give my example.
Read more
16 Jan 2016
Web scraping
So many data are on the web nowadays, and this situation will go on. There is a technology called web scraping letting us obtain data from websites repeatedly and automatically. Moreover, this procedure can transform the “unstructured” data (actually many of the content in the web page is structured, so that you can scrape it for subsequent analysis) to “structured” data (more cleaned data), see Wikipedia for more details. There are some examples scraping data with R (this, this and this)
Read more