This notebook is about statistical analyses at feature level. This uses a data table after preprocessing, quality control and normalization. Annotation is not necessary at this step.
Common study designs are
Each is a separte topic, and this notebook deals with group comparisons. Please refer to this paper for extended discussions:
Gardinassi L, Xia J, Safo S, Li S. (2017) Bioinformatics tools for the interpretation of metabolomics data. Current Pharmacology Reports, December 2017, Volume 3, Issue 6, pp 374–383. DOI: 10.1007/s40495-017-0107-0.
import os
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.sandbox.stats.multicomp import multipletests
%matplotlib inline
INDIR = "./input_data/"
datafile = "ave_log2_modc_ae_2012.txt"
data = pd.read_table(os.path.join(INDIR + datafile))
data.head()
# get group definition
# This can be from user supplied file.
# manually define here for this example dataset
header = [x for x in data]
baseline = [x for x in header[2:] if '0hr' in x]
mock = [x for x in header[2:] if 'mock_6hr' in x]
yfv = [x for x in header[2:] if 'yf_6hr' in x]
print("groups: ", (baseline, mock, yfv))
Student's t-test is a parametric method commonly used to compare two groups. But the method assume normal distribution of data. If data contain a lot of missing values, nonparametric methods are better suited, e.g. Mann-Whitney U test (stats.mannwhitneyu). A paired test should be used on repeated measurements (e.g. stats.ttest_rel).
number_features = data.shape[0]
# this line select columns by sample names
data_baseline, data_mock, data_yfv = data[baseline].values, data[mock].values, data[yfv].values
# compare yfv and baseline
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
# this is t-test on independent samples
t,p = stats.ttest_ind(data_yfv[ii,:], data_baseline[ii,:])
# nan may be returned
if np.isnan(t):
t,p = 0,1
t_scores.append(t)
p_values.append(p)
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]
#sort output
new = []
for ii in range(number_features):
mz, rtime = data.values[ii, :2]
new.append([p_values[ii]] + [str(x) for x in [
mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()
# output format
# mz, rtime, p, t, row_num,
s = '\t'.join(['m/z', 'retention_time', 'p-value', 't-score', 'FDR_BH'
'row_number']) + '\n'
for L in new:
s += '\t'.join(L[1:]) + '\n'
with open("ttest_yfv_baseline_" + datafile, "w") as file:
file.write(s)
# compare 3 groups
t_scores, p_values, fdrs = [], [], []
for ii in range(number_features):
# t is actually f-score here
t,p = stats.f_oneway(data_yfv[ii,:], data_mock[ii,:], data_baseline[ii,:])
# nan may be returned
if np.isnan(t):
t,p = 0,1
t_scores.append(t)
p_values.append(p)
# Compute FDR using Benjamini-Hochberg procedure
fdrs = multipletests(p_values, method = 'fdr_bh')[1]
#sort output
new = []
for ii in range(number_features):
mz, rtime = data.values[ii, :2]
new.append([p_values[ii]] + [str(x) for x in [
mz, rtime, p_values[ii], t_scores[ii], fdrs[ii], 'row_'+str(ii+1)]])
new.sort()
# output format
# mz, rtime, p, t, row_num,
s = '\t'.join(['m/z', 'retention_time', 'p-value', 'f-score', 'FDR_BH'
'row_number']) + '\n'
for L in new:
s += '\t'.join(L[1:]) + '\n'
with open("ANOVA_" + datafile, "w") as file:
file.write(s)