Monomer Table | Biologics | SARvision

How to Prepare Monomer Tables for Biologic Research

by Mark Hansen, Ph.D.

In sequence analysis, every monomer has a structure and associated data to help elucidate Sequence Activity Relationships.

In sequence analysis, every monomer has a structure and associated data to help elucidate Sequence Activity Relationships.

The Monomer table (also referred to as the Residue Table) contains information about each monomer used in the sequence. Monomer information and parameters give meaning to the sequence and enhances the resulting sequence analysis. These parameters can color monomers by type and property, be used to sort monomers in a coherent way, give structural context in mouse overs, and be used in calculations.

The sequence table shown below illustrates the utility that data in the monomer table can add to sequence analyses. Monomer font colors are red, green and black for natural, enantiomer and unnatural amino acids respectively; background color of individual cells uses the BKG:Hydrophobicity-HW column to color the background based on hydrophobicity of each monomer, mouse over displays the chemical structure and the formal name of a monomer, and finally, sorting a sequence column sorts by the SORTORDER column in the monomer table to group like monomers together. Collectively, additions of these properties significantly enhance interpretability of sequence tables used in analyze activity.

Example of monomer data and coloring used to make the sequence table intuitive with a greater depth of information.

Example of monomer data and coloring used to make the sequence table intuitive with a greater depth of information.

At the bare minimum, a monomer table should contain a structure, naming conventions, and any physico-chemical and coloring parameters that may augment analysis. Additional columns should include a sorting column so that sequence columns in an alignment can be sorted, the closest natural residue to be used as a substitution for alignment algorithms, a category field (e.g. hydrophobic, aromatic, charged….) and a font color to help designate type (e.g. black: natural residue, red: enantiomer, green: unnatural residue, blue: N-methylated…..). An example monomer table is shown below. For naming monomers (short, medium and synonyms), ‘|’ (pipe) is a chain break and should not be used as a character in the names. Similarly privileged are single or double quotes, brackets, ‘#’ and periods: all of these should not be used. However, any character can be used in the long names. Note that if a residue name is used twice for two different structures, then only the last occurrence is retained. Duplicate names should be avoided.

Column field descriptions:The SMILES column contains a smiles string that encodes the chemical structure of the monomer. The column is necessary, but the individual entries can be empty. Note that Smiles strings can be generated for molecules using any chemically aware spreadsheet program (such as SARvision|SM).Three name columns that contain a short, medium and long name. These are used interchangeably inside SARvision|Biologics to optimize the look of tables and views. Typically, the ‘SHORT_NAME’ is the smallest possible abbreviation, ideally 1-2 letters, the ‘MEDIUM_NAME’ designates a 3-4 letter abbreviation similar to those used in the PDB and last is the ‘LONG_NAME’ which encodes the formal name for a residue. Examples would be short: F, medium: PHE: and long: L-phenylalanine.The SYNONYMS field contains any other names used for the same residue separated by semicolons. This is used in sequence parsing to consolidate naming conventions used in disparate research groups. An example would be:  PHE;Phe; F to designate a phenylalanine. This is a great way to normalize names without having to edit all the sequences that have been generated.CLUSTAL_SUBSTITION contains the closest natural residue (e.g. ‘p-fluoro-phenylalanine’ most similar to ‘F’). This is used as a substitution when performing alignments that employ algorithms that use PAM and Blossum matrices.CATEGORY is an allows for an arbitrary designation of a monomers. It could be any text useful for categorizing residues such as ‘aromatic’, ‘polar’, ‘lippophilic’, ‘warhead’ or null. In several views these are used to group monomers into groups.The FONT_COLOR is the color of the font for this residue in the program. These are usually black with red, blue, green…. To designate unnatural or otherwise interesting residues. Note that RGB(##,##,##) can be used instead of the common color names. The SORTORDER column is a real number tells the program how to sort residues. These numbers are completely arbitrary and left to the user’s discretion. An example would be Phe:10, D-Phe: 10.01, p-F-Phe:10.02, m-F-Phe: 10.03, m-methyl-Phe: 10.04, N-methyl-Phe: 10.05, Tyr: 11 which when sorted would group the phenylalanine residues together and arrange them by substitution pattern on the phenyl ring.The DATA:ColumnName are data columns. There can be an arbitrary number of columns that contain numeric data that describes the monomer and may be useful for analysis. Note that SARvision|Biologics automatically adds a number of rdkit calcuated properties by default. Note that early versions of SARvision used “<DATA>” instead of the “DATA:” designation in the name field.  This is still recognized, however, the named using “DATA:” is database friendly and recommended.The BKG:ColumnName are coloring columns that when applied color the background in the alignment table. There can be an arbitrary number of columns that can contain colors. Note that colors are defined using RGB color coding but it can recognize simple names such as ‘blue’, ‘green’, ‘red’… Note that SARvision|Biologics automatically adds a number of rdkit calcuated properties by default. Note that early versions of SARvision used “<BKG>” instead of the “BKG:” designator in the name field.  This is still recognized, however, the named using “BKG:” is database friendly and recommended.

Column field descriptions:

  • The SMILES column contains a smiles string that encodes the chemical structure of the monomer. The column is necessary, but the individual entries can be empty. Note that Smiles strings can be generated for molecules using any chemically aware spreadsheet program (such as SARvision|SM).

  • Three name columns that contain a short, medium and long name. These are used interchangeably inside SARvision|Biologics to optimize the look of tables and views. Typically, the ‘SHORT_NAMEis the smallest possible abbreviation, ideally 1-2 letters, the ‘MEDIUM_NAMEdesignates a 3-4 letter abbreviation similar to those used in the PDB and last is the LONG_NAME’ which encodes the formal name for a residue. Examples would be short: F, medium: PHE: and long: L-phenylalanine.

  • The SYNONYMS field contains any other names used for the same residue separated by semicolons. This is used in sequence parsing to consolidate naming conventions used in disparate research groups. An example would be: PHE;Phe; F to designate a phenylalanine. This is a great way to normalize names without having to edit all the sequences that have been generated.

  • CLUSTAL_SUBSTITION contains the closest natural residue (e.g. ‘p-fluoro-phenylalanine’ most similar to ‘F’). This is used as a substitution when performing alignments that employ algorithms that use PAM and Blossum matrices.

  • CATEGORY is an allows for an arbitrary designation of a monomers. It could be any text useful for categorizing residues such as ‘aromatic’, ‘polar’, ‘lippophilic’, ‘warhead’ or null. In several views these are used to group monomers into groups.

  • The FONT_COLOR is the color of the font for this residue in the program. These are usually black with red, blue, green…. To designate unnatural or otherwise interesting residues. Note that RGB(##,##,##) can be used instead of the common color names.

  • The SORTORDER column is a real number tells the program how to sort residues. These numbers are completely arbitrary and left to the user’s discretion. An example would be Phe:10, D-Phe: 10.01, p-F-Phe:10.02, m-F-Phe: 10.03, m-methyl-Phe: 10.04, N-methyl-Phe: 10.05, Tyr: 11 which when sorted would group the phenylalanine residues together and arrange them by substitution pattern on the phenyl ring.

  • The DATA:ColumnName are data columns. There can be an arbitrary number of columns that contain numeric data that describes the monomer and may be useful for analysis. Note that SARvision|Biologics automatically adds a number of rdkit calcuated properties by default. Note that early versions of SARvision used “<DATA>” instead of the “DATA:” designation in the name field. This is still recognized, however, the named using “DATA:” is database friendly and recommended.

  • The BKG:ColumnName are coloring columns that when applied color the background in the alignment table. There can be an arbitrary number of columns that can contain colors. Note that colors are defined using RGB color coding but it can recognize simple names such as ‘blue’, ‘green’, ‘red’… Note that SARvision|Biologics automatically adds a number of rdkit calcuated properties by default. Note that early versions of SARvision used “<BKG>” instead of the “BKG:” designator in the name field. This is still recognized, however, the named using “BKG:” is database friendly and recommended.

The molecular spreadsheet in SARvision|SM can be used to help create residue tables.

The molecular spreadsheet in SARvision|SM can be used to help create residue tables.

The Monomer table can reside in any of several places. SARvision comes with a default Monomer table stored locally and can be added to manually using Excel or a molecular spreadsheet program. Or the monomers can be stored in Oracle using a molecule registration system and retrieved as necessary by SARvision. One good solution is to use CDDVault to register and store monomers for retrieval on demand by SARvision. This is an excellent way to keep monomers up to date and consistent across multiple research groups.

Monomer tables can be stores as a file locally, in Oracle or in CDDVault.

Monomer tables can be stores as a file locally, in Oracle or in CDDVault.

In addition to the monomer table, SARvision supports the use of a Modifier table. This is not used often but offers the ability to annotate sequences that have modifications that do not exist in the monomer table. For example, isotope labeling or pegylation can be added here. Similar to the Monomer table, these can be edited in excel and saved as a file or can be stored in a database system such as Oracle or CDDVault for on demand retrieval by the program. An example of a Modifier table is shown below. Modifiers use many of the same fields as the monomer table and behave similarly.

A simple modifier table contains annotations to describe modified positions. Modifier tables are not used often.

A simple modifier table contains annotations to describe modified positions. Modifier tables are not used often.

Previous
Previous

Analyzing Sequences from CDDVault

Next
Next

Sequence Formats for SARvision