Clustering in R and Python

29 Feb 2016

Clustering is a kind of unsupervised machine learning method which does not need any predefined labels to describe the entities. The clustering algorithms can classify the observations to several groups by operating on the relevant features to minimize the variations within groups.

Classification in R and Python

18 Feb 2016

Data set iris which contains 150 samples with 4 morphological variables in three species of iris is wildly used by data scientists to demonstrate certain machine learning algorithm. I use k-nearest neighbors method^[1] to classify the species based on their sepal length and width and petal length and width.

Interactive global map with D3.js

17 Feb 2016

Global map

Inspired by these posts (this, this and this), I create a spherical map as follow:

Simple interactive stacked bar graph with D3.js

16 Feb 2016

A picture is worth a thousand words. And It will be more funny and effective if we can interact with the chart.

The hardwares and the web technologies make it possible for average person to achieve interactive data visualization on the Internet. Among these web technologies, JavaScript equipped with many extension packages is the most powerful one. Here is an example using D3.js to make an interactive stacked bar chart.

Parallel computing in R

03 Feb 2016

R was born to be single threaded, so it will be slow for big data sets or heavy time consuming computations. We can use some third-party packages to parallelize our task on multiple cores or machines. Normally, we define the number of cores, and initialize the workers which computations can be sent to first. Then split the task into pieces and distribute the sub-tasks to workers. Shut down the cluster and do more analysis on the results at last. Here, I use the functions in package snowfall which depends on snow to give an example.

Basic spatial analysis with R

27 Jan 2016

Spatial analysis

In the field of my study, many datasets contain the information of the location, and these information will impact the results of the analyses. Moreover, we will always acquire new knowledge from these spatial data. For the location-based data, specific analytical techniques will be used. Here I list some very very basic functions in R for spatial analysis. I suggest two books Applied Spatial Data Analysis with R and An Introduction to R for Spatial Analysis and Mapping for further reading.

Multimodel inference and model averaging

25 Jan 2016

Model selection

When we have multiple competing models (corresponding to various hypotheses), we need some techniques to compare the performance of the models, then select the best model(s) to support specific hypothesis.

(Phylogenetic) Generalized Linear (Mixed) Model

22 Jan 2016

General linear model

Linear regression is an important foundation of many statistical technologies, while I have no ability to give more mathematical formulas or theories here. I only want to emphasis some of its key assumptions, linearity, homoscedasticity, independence, and normality^{1, 2, 3}.

Basic manipulations on phylogenetic tree with R

20 Jan 2016

Phylogenetic relationship

As last post says species traits are more similar between relative pairs while less similar between distant pairs. We need some mathematical definitions of the relationship for further calculations. Phylogenetic tree can quantify the relationships between species in certain scale. In other words, the distance matrix of the species’ relationships was demonstrated in a form of a tree. There are many methods and tools to construct a tree under different assumptions with multiple algorithms, based on the sequences of DNA or amino acid or other distinguishable traits. Even better, many supertrees have been built suitably at the level of species.

There are some tutorials about the primary phylogenetic analyses in R (this, this, this and this). I will give my example.

Web Scraping with R

16 Jan 2016

Web scraping

So many data are on the web nowadays, and this situation will go on. There is a technology called web scraping letting us obtain data from websites repeatedly and automatically. Moreover, this procedure can transform the “unstructured” data (actually many of the content in the web page is structured, so that you can scrape it for subsequent analysis) to “structured” data (more cleaned data), see Wikipedia for more details. There are some examples scraping data with R (this, this and this)

Xianping Li A learner

Global map

Spatial analysis

Model selection

General linear model

Phylogenetic relationship

Web scraping