Statistical analyses of metabolite features (group comparison)

This notebook is about statistical analyses at feature level. This uses a data table after preprocessing, quality control and normalization. Annotation is not necessary at this step.

Common study designs are

  • group comparison
  • linear regression to test association with a variable
  • time series

Each is a separte topic, and this notebook deals with group comparisons. Please refer to this paper for extended discussions:

Gardinassi L, Xia J, Safo S, Li S. (2017) Bioinformatics tools for the interpretation of metabolomics data. Current Pharmacology Reports, December 2017, Volume 3, Issue 6, pp 374–383. DOI: 10.1007/s40495-017-0107-0.

In [1]:
import os
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.sandbox.stats.multicomp import multipletests

%matplotlib inline

INDIR = "./input_data/"
datafile = "ave_log2_modc_ae_2012.txt"
data = pd.read_table(os.path.join(INDIR + datafile))
data.head()
Out[1]:
mz retention_time mock_6hr_01 mock_6hr_02 mock_6hr_03 p_0hr_01 p_0hr_02 p_0hr_03 yf_6hr_01 yf_6hr_02 yf_6hr_03
0 85.02783 59 17.23 17.18 17.44 15.67 15.57 17.26 16.94 16.98 16.56
1 85.04717 124 16.53 15.80 16.62 14.37 14.85 15.63 16.71 15.65 16.27
2 85.06532 68 10.80 10.61 11.36 14.87 14.89 12.53 14.42 14.03 10.81
3 85.10073 16 13.32 13.16 13.42 13.77 12.42 13.31 12.99 12.99 12.95
4 86.05951 67 18.42 18.15 18.37 15.18 17.02 17.91 17.76 17.67 15.15
In [2]:
# get group definition
# This can be from user supplied file.

# manually define here for this example dataset
header = [x for x in data]
baseline = [x for x in header[2:] if '0hr' in x]
mock = [x for x in header[2:] if 'mock_6hr' in x]
yfv = [x for x in header[2:] if 'yf_6hr' in x]

print("groups: ", (baseline, mock, yfv))
('groups: ', (['p_0hr_01', 'p_0hr_02', 'p_0hr_03'], ['mock_6hr_01', 'mock_6hr_02', 'mock_6hr_03'], ['yf_6hr_01', 'yf_6hr_02', 'yf_6hr_03']))

Compare two groups

Student's t-test is a parametric method commonly used to compare two groups. But the method assume normal distribution of data. If data contain a lot of missing values, nonparametric methods are better suited, e.g. Mann-Whitney U test (stats.mannwhitneyu). A paired test should be used on repeated measurements (e.g. stats.ttest_rel).

In [3]:
number_features = data.shape[0]
# this line select columns by sample names
data_baseline, data_mock, data_yfv = data[baseline].values, data[mock].values, data[yfv].values

# compare yfv and baseline
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
    # this is t-test on independent samples
    t,p = stats.ttest_ind(data_yfv[ii,:], data_baseline[ii,:])
    # nan may be returned
    if np.isnan(t):
        t,p = 0,1
    t_scores.append(t)
    p_values.append(p)
        
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]

#sort output
new = []
for ii in range(number_features):
    mz, rtime = data.values[ii, :2]
    new.append([p_values[ii]] + [str(x) for x in [
                mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()

# output format 
# mz, rtime, p, t, row_num, 
s = '\t'.join(['m/z', 'retention_time', 'p-value', 't-score', 'FDR_BH'
                   'row_number']) + '\n'
for L in new:
    s += '\t'.join(L[1:]) + '\n'
    
with open("ttest_yfv_baseline_" + datafile, "w") as file:
    file.write(s)

Multiple group comparison using ANOVA

In [4]:
# compare 3 groups
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
    # t is actually f-score here
    t,p = stats.f_oneway(data_yfv[ii,:], data_mock[ii,:], data_baseline[ii,:])
    # nan may be returned
    if np.isnan(t):
        t,p = 0,1
    t_scores.append(t)
    p_values.append(p)
        
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]

#sort output
new = []
for ii in range(number_features):
    mz, rtime = data.values[ii, :2]
    new.append([p_values[ii]] + [str(x) for x in [
                mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()

# output format 
# mz, rtime, p, t, row_num, 
s = '\t'.join(['m/z', 'retention_time', 'p-value', 'f-score', 'FDR_BH'
                   'row_number']) + '\n'
for L in new:
    s += '\t'.join(L[1:]) + '\n'
    
with open("ANOVA_" + datafile, "w") as file:
    file.write(s)