Building a metabolic model for mummichog 2

A metabolic model describes the relationship between metabolites (compounds) through metabolic reactions.

A reaction is the conversion from reactants to products. These reactions may be catalyzed by enzymes, and they may belong to predefined metabolic pathways. Although broadly speaking, there's a class of transport reactions, which move metabolites across cellular compartments by the help of transporter proteins.

Because the products of one reaction are often the reactants of another, most reactions are connected into a large metabolic network. Reactants and products can also be shared or competed between reactions. The metabolic network is often described via a bipartite graph, with metabolites and enzymes as the two node types.

These data for a metabolic model can be from a database, e.g. BioCyc or other published models, e.g. RECON2. This notebook shows how the model from mummichog 1 is converted to JSON format for mummichog 2. The design change in version 2 is to separate the model more cleanly from the algorithm, easier to do web integration and model update.

By Shuzhao Li, 2017-07-17

In [2]:
import json
# this human_model_mfn model is from mummichog version 1
# Currently, mummichog is a statistical approach that does not take into account of reaction rates.
import human_model_mfn as model
dir(model)
Out[2]:
['__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 'cpd2pathways',
 'cpd_edges',
 'dict_cpds_def',
 'dict_cpds_mass',
 'edge2enzyme',
 'edge2rxn',
 'metabolic_pathways',
 'metabolic_rxns',
 'pathdotdict',
 'version']
In [3]:
for x in [
    model.dict_cpds_def,
    model.cpd_edges,
    model.dict_cpds_mass,
    model.metabolic_rxns,
    model.metabolic_pathways,
]:
    print (len(x))
    
# also see what's in cpd def
print(model.dict_cpds_def.items()[:3])

# will need fix a bug when cpd name doesn't show...
4650
18504
13186
4204
119
[('C00259', 'L-Arabinose; L-Arabinopyranose'), ('CE1714', 'dehydrodolichol'), ('crtn', 'Creatinine')]

Getting chemical formula and calculate adducts

No formula was in the MFN model in mummichog 1. Let's get from RECON2 and HMDB. RECON2 is a compartmentalized model, where ''' c = cytoplasm e = extracellular space g = Golgi apparatus l = lysosome m = mitochondrion n = nucleus r = endoplasmic reticulum x = peroxisome '''

In [4]:
# get relevant compounds in model
cpdsInModel = []
for x in model.cpd_edges: 
    cpdsInModel += x

cpds2include = model.cpd2pathways.keys() + cpdsInModel
cpds2include = set(cpds2include)
print( len(cpds2include) )
3560
In [5]:
# These are previously parsed data.
hmdb = 'parsed_HMDBmetabolites3.5.txt'
recon2 = 'parsed_species_recon2.txt'

hmdb = open(hmdb).read().splitlines()
recon2 = open(recon2).read().splitlines()

h = hmdb[0].split('\t')
print(h)
print("chemical_formula", h.index("chemical_formula"))
print("kegg_id", h.index("kegg_id"))

print(recon2[0].split('\t'))
['accession', 'name', 'chemical_formula', 'monisotopic_moleculate_weight', 'iupac_name', 'traditional_iupac', 'cas_registry_number', 'smiles', 'inchi', 'inchikey', 'pathways', 'normal_concentrations', 'abnormal_concentrations', 'diseases', 'drugbank_id', 'drugbank_metabolite_id', 'phenol_explorer_compound_id', 'phenol_explorer_metabolite_id', 'foodb_id', 'knapsack_id', 'chemspider_id', 'kegg_id', 'biocyc_id', 'bigg_id', 'wikipidia', 'nugowiki', 'metagene', 'metlin_id', 'pubchem_compound_id', 'het_id', 'chebi_id', 'protein_associations']
('chemical_formula', 2)
('kegg_id', 21)
['#id', 'name', 'compartment', 'FORMULA', 'bodytext', 'annotation', 'kegg', 'hmdb']
In [6]:
# build dictionary, {KEGG-like ID: chemical_formula}
dict_formula = {}
for line in hmdb:
    a = line.split('\t')
    dict_formula[a[21]] = a[2]
    
# RECON2 as id format like `M_34dhoxmand_c` 
for line in recon2:
    a = line.split('\t')
    if a[0][0] == "M":
        dict_formula[a[0].split('_')[1]] = a[3]

print(len(dict_formula))
print("Overlap number of cpds = ", len(cpds2include.intersection( set(dict_formula.keys()) )))
6725
('Overlap number of cpds = ', 2188)
In [7]:
print(len( set(model.dict_cpds_mass.keys()).intersection(cpds2include) ))
print(len( set(model.dict_cpds_mass.keys()).intersection(set(dict_formula.keys())) ))
print(len( set(model.dict_cpds_mass.keys()).intersection(set(dict_formula.keys())).intersection(cpds2include) ))
2019
4500
1700
In [8]:
# How many cpds don't have mass but have formula?
len([x for x in cpds2include if x not in model.dict_cpds_mass.keys() and x in dict_formula.keys()])
Out[8]:
488

So need to compute mass if not in dict_cpds_mass

The dict_cpds_mass may originally come from KEGG, not sure if mass deficiency from chemical bonding was considered.

In [9]:
# new mass calculation
# using pychemy module. It failed to install using pip because of `Open Babel` binding.
# But the mass calcuation function is not depdendent on `Open Babel`, 
# thus pychemy folder is copied here and it's usable as below

from pychemy.molmass import Formula

massdict_cpds2include = {}
unassigned = []
for x in cpds2include:
    if model.dict_cpds_mass.has_key(x):
        massdict_cpds2include[x] = model.dict_cpds_mass[x]
    elif dict_formula.has_key(x):
        try:
            massdict_cpds2include[x] = Formula( dict_formula[x] ).isotope.mass
        except:
            print(dict_formula[x])
            unassigned.append(x)
    else:
        unassigned.append(x)

print(len(massdict_cpds2include))
print("unassigned, ", len(unassigned))
C29H51N2O8PRS
C11H19O9X
C5H9O4X
C68H113N4O49X

C25H45N2O8PRS
C22H37N2O15X
C29H53N2O8PRS
C8H14NO8PR2
CO2FULLR2C21H31N7O15P3S





RH2

C9H15NO8PR2
CO2FULLR2
C13H23N2O8PRS
C11H15O16P2R2

C27H49N2O8PRS
C29H55N2O8PRS




R
C4H4O7PR
C22H37N2O15X
C24H40N3O15X
C29H51N2O8PRS
C66H111O56X
C42H71O36X
C66H111O56X
C12H21O11X
C18H31O16X
CO2FULLR2C7H14NO2
C48H81O41X
CO2FULLRC21H31N7O15P3S
C3H6O6PR
C5H8O7PR
CO2FULLR3C7H14NO2
C5H9O4R
C50H83N4O35X
CO2FULLRC7H14NO2
C29H51N2O8PRS
C90H145N6O65X
CO2FULLR


C48H81O41X
CO2FULLR3C21H31N7O15P3S
C46H77N2O35X
C31H56NO13R
C20H29OFULLR2CO
C16H27N2O10X
C84H135N6O61X
C11H14O19P3R2
2447
('unassigned, ', 1113)

The above mass calculation raised the number of compounds with mass from 2019 to 2447.

Next, compute adducts.

In [10]:
# computing adducts filtered by formula
# if no formula is available, should use default adduct calculation
# make sure mono mass is present

import re

def parse_chemformula(x):
    '''This does not deal with nested groups in chemical formula.
    Formula from HMDB 3.5 are compatible.
    '''
    p = re.findall(r'([A-Z][a-z]*)(\d*)', x)
    d = {}
    for pair in p: d[pair[0]] = int( pair[1] or 1 )
    return d

def check_sub(Fm1, Fm2):
    '''Check if Fm2 a subunit of Fm1; Dictionary representation of chem formula
    '''
    Fm2_in_Fm1 = True
    Elements1 = Fm1.keys()
    Elements2 = Fm2.keys()
    if [x for x in Elements2 if x not in Elements1]:    #element not in Fm1
        return False
    else:
        for e in Elements2:
            if Fm2[e] > Fm1[e]: Fm2_in_Fm1 = False
        return Fm2_in_Fm1

def compute_adducts(mw, cFormula):
    '''
    # new function calculating adducts
    # A few rules here:
    # C13, S34, Cl37 pos only applicable if the atom is in formula
    # and others in `required subgroup`, as the last item in the tuples
    # Better way is to based on chemical structure - future direction.
    '''
    PROTON = 1.00727646677
    addList = [(mw, 'M[1+]', ''), 
         (mw + PROTON, 'M+H[1+]', ''),
         (mw/2 + PROTON, 'M+2H[2+]', ''),
         (mw/3 + PROTON, 'M+3H[3+]', ''),
         (mw +1.0034 + PROTON, 'M(C13)+H[1+]', 'C'),
         (mw/2 + 0.5017 + PROTON, 'M(C13)+2H[2+]', 'C'),
         (mw/3 + 0.3344 + PROTON, 'M(C13)+3H[3+]', 'C'),
         (mw +1.9958 + PROTON, 'M(S34)+H[1+]', 'S'),
         (mw +1.9972 + PROTON, 'M(Cl37)+H[1+]', 'Cl'),
         (mw + 21.9820 + PROTON, 'M+Na[1+]', ''), 
         (mw/2 + 10.991 + PROTON, 'M+H+Na[2+]', ''),
         (mw + 37.9555 + PROTON, 'M+K[1+]', ''), 
         (mw + 18.0106 + PROTON, 'M+H2O+H[1+]', ''), 
         (mw - 18.0106 + PROTON, 'M-H2O+H[1+]', 'H2O'), 
         (mw - 36.0212 + PROTON, 'M-H4O2+H[1+]', 'H4O2'),
         (mw - 17.0265 + PROTON, 'M-NH3+H[1+]', 'NH3'),
         (mw - 27.9950 + PROTON, 'M-CO+H[1+]', 'CO'),
         (mw - 43.9898 + PROTON, 'M-CO2+H[1+]', 'CO2'),
         (mw - 46.0054 + PROTON, 'M-HCOOH+H[1+]', 'H2CO2'),
         (mw + 67.9874 + PROTON, 'M+HCOONa[1+]', ''),
         (mw - 67.9874 + PROTON, 'M-HCOONa+H[1+]', 'HCO2Na'),
         (mw + 57.9586 + PROTON, 'M+NaCl[1+]', ''), 
         (mw - 72.0211 + PROTON, 'M-C3H4O2+H[1+]', 'C3H4O2'),
         (mw + 83.9613 + PROTON, 'M+HCOOK[1+]', ''),
         (mw - 83.9613 + PROTON, 'M-HCOOK+H[1+]', 'HCO2K'),
         ] + [(mw - PROTON, 'M-H[-]', ''),
        (mw/2 - PROTON, 'M-2H[2-]', ''),
        (mw + 1.0034 - PROTON, 'M(C13)-H[-]', 'C'),
        (mw + 1.9958 - PROTON, 'M(S34)-H[-]', 'S'),
        (mw + 1.9972 - PROTON, 'M(Cl37)-H[-]', 'Cl'),
        (mw + 21.9820 - 2*PROTON, 'M+Na-2H[-]', ''),
        (mw + 37.9555 - 2*PROTON, 'M+K-2H[-]', ''),
        (mw - 18.0106 - PROTON, 'M-H2O-H[-]', 'H2O'),
        (mw + 34.9689, 'M+Cl[-]', ''),
        (mw + 36.9659, 'M+Cl37[-]', ''),
        (mw + 78.9183, 'M+Br[-]', ''),
        (mw + 80.9163, 'M+Br81[-]', ''),
        (mw + 2*12 + 3*1.007825 + 14.00307 - PROTON, 'M+ACN-H[-]', ''),
        (mw + 1.007825 + 12 + 2*15.99491, 'M+HCOO[-]', ''),
        (mw + 3*1.007825 + 2*12 + 2*15.99491, 'M+CH3COO[-]', ''),
        (mw - PROTON + 15.99491, 'M-H+O[-]', ''),
        ]

    dict_cFormula = parse_chemformula(cFormula)
    mydict = {}
    for x in addList:
        if check_sub(dict_cFormula, parse_chemformula(x[2])):
            mydict[x[1]] = x[0]

    return mydict

# test
print(dict_formula['C05587'])
print(model.dict_cpds_def['C05587'])
print(model.dict_cpds_mass['C05587'])

print (compute_adducts( model.dict_cpds_mass['C05587'], dict_formula['C05587'] ))
C9H13NO2
3-Methoxytyramine
167.0946
{'M+K[1+]': 206.05737646677002, 'M+Cl[-]': 202.0635, 'M+Br[-]': 246.0129, 'M+H[1+]': 168.10187646677002, 'M+HCOOK[1+]': 252.06317646677002, 'M(C13)+3H[3+]': 57.03987646677001, 'M+ACN-H[-]': 207.11386853323, 'M+Cl37[-]': 204.06050000000002, 'M-H2O-H[-]': 148.07672353323, 'M+CH3COO[-]': 226.107895, 'M[1+]': 167.0946, 'M-HCOOH+H[1+]': 122.09647646677, 'M+NaCl[1+]': 226.06047646677, 'M+H+Na[2+]': 95.54557646677, 'M+H2O+H[1+]': 186.11247646677003, 'M-H+O[-]': 182.08223353323, 'M+K-2H[-]': 203.03554706646003, 'M+2H[2+]': 84.55457646677, 'M+Br81[-]': 248.01090000000002, 'M-H2O+H[1+]': 150.09127646677, 'M-C3H4O2+H[1+]': 96.08077646677, 'M-NH3+H[1+]': 151.07537646677002, 'M+Na-2H[-]': 187.06204706646002, 'M-CO2+H[1+]': 124.11207646677, 'M+Na[1+]': 190.08387646677002, 'M-2H[2-]': 82.54002353323001, 'M-H4O2+H[1+]': 132.08067646677003, 'M(C13)-H[-]': 167.09072353323, 'M(C13)+2H[2+]': 85.05627646677, 'M+HCOO[-]': 212.09224500000002, 'M-H[-]': 166.08732353323, 'M+HCOONa[1+]': 236.08927646677, 'M+3H[3+]': 56.70547646677001, 'M(C13)+H[1+]': 169.10527646677002, 'M-CO+H[1+]': 140.10687646677002}

Next steps

  • fix empty names
  • build a new dict of compounds that carry formula and adduct dict
In [11]:
#
missing_names = [x for x in cpds2include if not model.dict_cpds_def.get(x, '')]
print (len(missing_names))
print(missing_names[:20])

# a few drugs are in missing, but can be left out, 3560
cpds2include = [x for x in cpds2include if x[1]!='D']
print(len(cpds2include))
259
['C04878', 'C01620', 'CE5776', 'C02917', 'C01029', 'C02823', 'C03265', 'C16360', 'C02362', 'CE7217', 'C14860', 'C14862', 'C14867', 'C00088', 'CE5922', 'CE5920', 'CE5921', 'CE5926', 'CE5924', 'CE5928']
3557
In [12]:
# add name dict from RECON2
# RECON2 as id format like `M_34dhoxmand_c` 
names_recon2 = {}
for line in recon2:
    a = line.split('\t')
    if a[0][0] == "M":
        names_recon2[a[0].split('_')[1]] = a[1]

still_missing = [x for x in missing_names if x not in names_recon2.keys()]
print(len(still_missing))
#for x in still_missing:
#    print x
138

Manually finding other names...

  • KEGG batch query returned a good number (http://www.genome.jp/kegg/compound/). pasted to additional_KEGG_names.txt
  • found a copy of EHMN model at github.com/agnosti5/MOST/blob/master/etc/. pasted to additional_EHMN_names.txt
In [13]:
# OK add more name from KEGG adn EHMN model
names_extra = {}
# cpd:C01620   Threonate; L-Threonate; (2R,3S)-2,3,4-Trihydroxybutanoic acid
w = open('additional_KEGG_names.txt').read()
for line in w.splitlines():
    if 'No such data' not in line:
        names_extra[line[:13].split(':')[1].strip()] = line[13:].strip()

print(names_extra.items()[-2:])
        
# <species id="CE1941" name="isoputreanine"/>
for line in open('additional_EHMN_names.txt').readlines()[1:]:
    a = line.split('" name="')
    names_extra[a[0].split('id="')[1]] = a[1].split('"')[0]

print(len(names_extra))
print(names_extra.items()[-2:])

still_missing2 = [x for x in still_missing if x not in names_extra.keys()]
print(len(still_missing2))
print(still_missing2)

# so only 15-3 names still missing. These are likely to be obsolete IDs from KEGG
[('C18038', '7alpha-Hydroxypregnenolone; 3beta,7alpha-Dihydroxy-5-pregnen-20-one'), ('C02045', 'L-Erythrulose; L-glycero-Tetrulose')]
2764
[('CE4849', '3(S)-hydroxy-tetracosa-12,15,18,21-all-cis-tetraenoyl-CoA'), ('C14040', '1-Nitronaphthalene')]
15
['C04878', 'C03265', '"D01309 cpd"', 'C03279', 'C03020', 'C04946', 'C04820', '"D04197 cpd"', 'C00592', 'C03812', 'C05884', 'C05882', 'C05409', '"D00584 cpd"', 'C01828']
In [ ]:
 
In [14]:
# this is the new compound dict

newDict = {}
for c in cpds2include:
    name = model.dict_cpds_def.get(c, '') or names_recon2.get(c, '') or names_extra.get(c, '')
    mw = massdict_cpds2include.get(c, 0)
    f = dict_formula.get(c, '')
    if mw:
        newDict[c] = {'name': name,  'mw': mw, 'formula':f, 'adducts':compute_adducts(mw, f)}
    else:
        newDict[c] = {'name': name,  'mw': mw, 'formula':f, 'adducts':{}}

print(len(newDict))

print(newDict['C04878'])
print(newDict['CE4849'])
3557
{'formula': '', 'mw': 0, 'name': '', 'adducts': {}}
{'formula': 'C45H70N7O18P3S', 'mw': 1121.3710887012, 'name': '3(S)-hydroxy-tetracosa-12,15,18,21-all-cis-tetraenoyl-CoA', 'adducts': {'M+K[1+]': 1160.33386516797, 'M+Cl[-]': 1156.3399887012001, 'M+Br[-]': 1200.2893887012, 'M+H[1+]': 1122.37836516797, 'M+HCOOK[1+]': 1206.33966516797, 'M(C13)+3H[3+]': 375.13203936717, 'M+ACN-H[-]': 1161.39035723443, 'M+Cl37[-]': 1158.3369887012, 'M-H2O-H[-]': 1102.35321223443, 'M(S34)+H[1+]': 1124.37416516797, 'M+CH3COO[-]': 1180.3843837012, 'M[1+]': 1121.3710887012, 'M-HCOOH+H[1+]': 1076.37296516797, 'M+NaCl[1+]': 1180.33696516797, 'M+H+Na[2+]': 572.68382081737, 'M+H2O+H[1+]': 1140.3889651679701, 'M-H+O[-]': 1136.35872223443, 'M+K-2H[-]': 1157.31203576766, 'M+2H[2+]': 561.69282081737, 'M+Br81[-]': 1202.2873887012001, 'M-H2O+H[1+]': 1104.36776516797, 'M-C3H4O2+H[1+]': 1050.3572651679701, 'M-NH3+H[1+]': 1105.3518651679701, 'M+Na-2H[-]': 1141.33853576766, 'M-CO2+H[1+]': 1078.38856516797, 'M+Na[1+]': 1144.36036516797, 'M-2H[2-]': 559.67826788383, 'M-H4O2+H[1+]': 1086.3571651679702, 'M(C13)-H[-]': 1121.36721223443, 'M(C13)+2H[2+]': 562.1945208173701, 'M(S34)-H[-]': 1122.35961223443, 'M+HCOO[-]': 1166.3687337012, 'M-H[-]': 1120.36381223443, 'M+HCOONa[1+]': 1190.36576516797, 'M+3H[3+]': 374.79763936717, 'M(C13)+H[1+]': 1123.3817651679701, 'M-CO+H[1+]': 1094.3833651679702}}
In [15]:
# Convert to JSON format
# In [8]: '_'.join(sorted(('C00999', 'C00231')))
# Out[8]: 'C00231_C00999'

metabolicModels = {}
metabolicModels['human_model_mfn'] = {}
metabolicModels['human_model_mfn']['version'] = 'MFN_1.10.4'
metabolicModels['human_model_mfn']['Compounds'] = newDict

metabolicModels['human_model_mfn']['cpd2pathways'] = model.cpd2pathways
metabolicModels['human_model_mfn']['cpd_edges'] = model.cpd_edges
metabolicModels['human_model_mfn']['dict_cpds_def'] = model.dict_cpds_def
metabolicModels['human_model_mfn']['dict_cpds_mass'] = model.dict_cpds_mass

metabolicModels['human_model_mfn']['metabolic_pathways'] = model.metabolic_pathways
metabolicModels['human_model_mfn']['metabolic_rxns'] = model.metabolic_rxns
#metabolicModels['human_model_mfn']['pathdotdict'] = model.pathdotdict

# edges can't be tuples in JSON format
new_e2rxn = {}
for k,v in model.edge2rxn.items():
    new_e2rxn[','.join(sorted(k))] = v

metabolicModels['human_model_mfn']['edge2rxn'] = new_e2rxn

new_e2enzyme = {}
for k,v in model.edge2enzyme.items():
    new_e2enzyme[','.join(sorted(k))] = v

metabolicModels['human_model_mfn']['edge2enzyme'] = new_e2enzyme

out = open('JSON_metabolicModels.py', 'w')
out.write( 'metabolicModels = ' + json.dumps(metabolicModels) )
out.close()
In [16]:
# test import
from JSON_metabolicModels import metabolicModels
metabolicModels['human_model_mfn'].keys()
Out[16]:
['edge2rxn',
 'version',
 'metabolic_rxns',
 'cpd_edges',
 'metabolic_pathways',
 'Compounds',
 'dict_cpds_def',
 'cpd2pathways',
 'edge2enzyme',
 'dict_cpds_mass']

Adding this new version of model to mummichog 2.0.2-beta