Statistical analyses of metabolite features (group comparison)¶

This notebook is about statistical analyses at feature level. This uses a data table after preprocessing, quality control and normalization. Annotation is not necessary at this step.

Common study designs are

group comparison
linear regression to test association with a variable
time series

Each is a separte topic, and this notebook deals with group comparisons. Please refer to this paper for extended discussions:

Gardinassi L, Xia J, Safo S, Li S. (2017) Bioinformatics tools for the interpretation of metabolomics data. Current Pharmacology Reports, December 2017, Volume 3, Issue 6, pp 374–383. DOI: 10.1007/s40495-017-0107-0.

import os
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.sandbox.stats.multicomp import multipletests

%matplotlib inline

INDIR = "./input_data/"
datafile = "ave_log2_modc_ae_2012.txt"
data = pd.read_table(os.path.join(INDIR + datafile))
data.head()

# get group definition
# This can be from user supplied file.

# manually define here for this example dataset
header = [x for x in data]
baseline = [x for x in header[2:] if '0hr' in x]
mock = [x for x in header[2:] if 'mock_6hr' in x]
yfv = [x for x in header[2:] if 'yf_6hr' in x]

print("groups: ", (baseline, mock, yfv))

('groups: ', (['p_0hr_01', 'p_0hr_02', 'p_0hr_03'], ['mock_6hr_01', 'mock_6hr_02', 'mock_6hr_03'], ['yf_6hr_01', 'yf_6hr_02', 'yf_6hr_03']))

Compare two groups¶

Student's t-test is a parametric method commonly used to compare two groups. But the method assume normal distribution of data. If data contain a lot of missing values, nonparametric methods are better suited, e.g. Mann-Whitney U test (stats.mannwhitneyu). A paired test should be used on repeated measurements (e.g. stats.ttest_rel).

number_features = data.shape[0]
# this line select columns by sample names
data_baseline, data_mock, data_yfv = data[baseline].values, data[mock].values, data[yfv].values

# compare yfv and baseline
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
    # this is t-test on independent samples
    t,p = stats.ttest_ind(data_yfv[ii,:], data_baseline[ii,:])
    # nan may be returned
    if np.isnan(t):
        t,p = 0,1
    t_scores.append(t)
    p_values.append(p)
        
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]

#sort output
new = []
for ii in range(number_features):
    mz, rtime = data.values[ii, :2]
    new.append([p_values[ii]] + [str(x) for x in [
                mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()

# output format 
# mz, rtime, p, t, row_num, 
s = '\t'.join(['m/z', 'retention_time', 'p-value', 't-score', 'FDR_BH'
                   'row_number']) + '\n'
for L in new:
    s += '\t'.join(L[1:]) + '\n'
    
with open("ttest_yfv_baseline_" + datafile, "w") as file:
    file.write(s)

Multiple group comparison using ANOVA¶

# compare 3 groups
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
    # t is actually f-score here
    t,p = stats.f_oneway(data_yfv[ii,:], data_mock[ii,:], data_baseline[ii,:])
    # nan may be returned
    if np.isnan(t):
        t,p = 0,1
    t_scores.append(t)
    p_values.append(p)
        
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]

#sort output
new = []
for ii in range(number_features):
    mz, rtime = data.values[ii, :2]
    new.append([p_values[ii]] + [str(x) for x in [
                mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()

# output format 
# mz, rtime, p, t, row_num, 
s = '\t'.join(['m/z', 'retention_time', 'p-value', 'f-score', 'FDR_BH'
                   'row_number']) + '\n'
for L in new:
    s += '\t'.join(L[1:]) + '\n'
    
with open("ANOVA_" + datafile, "w") as file:
    file.write(s)

	mz	retention_time	mock_6hr_01	mock_6hr_02	mock_6hr_03	p_0hr_01	p_0hr_02	p_0hr_03	yf_6hr_01	yf_6hr_02	yf_6hr_03
0	85.02783	59	17.23	17.18	17.44	15.67	15.57	17.26	16.94	16.98	16.56
1	85.04717	124	16.53	15.80	16.62	14.37	14.85	15.63	16.71	15.65	16.27
2	85.06532	68	10.80	10.61	11.36	14.87	14.89	12.53	14.42	14.03	10.81
3	85.10073	16	13.32	13.16	13.42	13.77	12.42	13.31	12.99	12.99	12.95
4	86.05951	67	18.42	18.15	18.37	15.18	17.02	17.91	17.76	17.67	15.15