Sequence | SARvision

Formatting Sequences for SARvision

by Mark Hansen, Ph.D.

SARvision|Biologics reads in a number of sequence formats for analysis.

SARvision|Biologics reads in a number of sequence formats for analysis.

In biologics research, compounds are encoded as sequences of building blocks that can be easily analyzed for sequence activity relationships. It is also recommended to store a chemical structure of the biological compound where possible (smiles or sdf format), especially for studies involving macrocycles. Encoding both sequence and structure is straight forward and allows for informatics studies to be performed in structure or sequence space; both can be important.

SARvision reads in a number of sequence formats including FASTA, modified FASTA, aligned Fasta, Smiles, and HELMS. Many companies devise their own sequence formats: if you do not see one that is applicable below, then please contact us. We can help either convert sequences or implement new readers for you. Sequences can be read in from file (Excel *.csv format), CDDVault or Oracle and usually exist as a single field (labeled “sequences’) in tabular input along with columns of other data such as biological activities. Alternatively, as will be described below, sequences can be encoded across many columns, one monomer per cell where each column corresponds to single position in the sequence. We recommend that you store a structure (Smiles or SDF) and a sequence (FASTA or other) for each of your compounds. By doing this, molecule entities are well defined and one can traverse from sequence to structure space to analyze activity with ease.

Formats:

1. FASTA fromat is a format long used to store sequences genbank and expasy. It is simply a string of residues concatenated together. They can be stored in two forms: a FASTA file format (top) or tabular with a column of FASTA strings (bottom).

>sp|P30874|SSR2_HUMAN Somatostatin receptor type 2 OS=Homo sapiens OX=9606 GN=SSTR2 PE=1 SV=1

MDMADEPLNGSHTWLSIPFDLNGSVVSTNTSNQTEPYYDLTSNAVLTFIYFVVCIIGLCG

NTLVIYVILRYAKMKTITNIYILNLAIADELFMLGLPFLAMQVALVHWPFGKAICRVVMT

VDGINQFTSIFCLTVMSIDRYLAVVHPIKSAKWRRPRTAKMITMAVWGVSLLVILPIMIY

AGLRSNQWGRSSCTINWPGESGAWYTGFIIYTFILGFLVPLTIICLCYLFIIIKVKSSGI

RVGSSKRKKSEKKVTRMVSIVVAVFIFCWLPFYIFNVSSVSMAISPTPALKGMFDFVVVL

TYANSCANPILYAFLSDNFKKSFQNVLCLVKVSGTDDGERSDSKQDKSRLNETTETQRTL

LNGDLQTSI

Example of FASTA format in Excel:*.csv format. Additional data can be added to the file for import.

Example of FASTA format in Excel:*.csv format. Additional data can be added to the file for import.

2. Modified FASTA is nearly identical to FASTA with addition of chaing breaks: ‘|’, chain cross-links: number sequentially, and multi-character monomers placed inside brackets: ‘[Pal]’. These simple addtions allow inclusion of additional chemistries (>600 building blocks in some companies), multiple chains and implementation of macrocyclics. Inside the brackets, a modifier from the optional Modifier table can be included using a ‘-’: e.g. [N15-A] would be N15 labeled adenine. An example is shown below of this format.

Modified FASTA format where a modifier (optional: not often used) is in red, crosslinks between and within chains are numbered and colored in blue and finally brackets enclose multi-letter monomers. Note that single letter monomer (‘L’, ‘P’ and ‘T’) may use brackets or not.  The inclusion of a few features greatly expands the chemistry in a FASTA format while maintaining its simplicity.

Modified FASTA format where a modifier (optional: not often used) is in red, crosslinks between and within chains are numbered and colored in blue and finally brackets enclose multi-letter monomers. Note that single letter monomer (‘L’, ‘P’ and ‘T’) may use brackets or not. The inclusion of a few features greatly expands the chemistry in a FASTA format while maintaining its simplicity.

3. Aligned FASTA format is similar to above, except that dashes added to encode gaps. Including gaps in the sequence allows for the format to retain alignment either calcualted elsewhere or deteremined by the synthetic patterns to create the peptide library. An example is shown below.

The addition of the [-][-] places two gaps in the macrocycle formed between T and the C-terminus. These gaps create maintain a predefined alignment.

The addition of the [-][-] places two gaps in the macrocycle formed between T and the C-terminus. These gaps create maintain a predefined alignment.

4. Peptides are often encoded as SMILES at many companies. We actually recommend that user store a structure and a sequence for their biologics: this can include a SMILES and A FASTA. SARvision|Biologics can parse simple peptides SMILES but it does have limiations if the peptides deviate from amide backbones, form complex cycles or have multiple chains. For the simple cases it can work fine. When SMILES is read into the program, a sequence is created and loaded into the program for reference.

5. Sequences can be stored pre-aligned in multiple columns. On import, select multiple column option and pre-aligned sequences flag as shown below.

Sequences_6.PNG

6. SARvision reads HELMs format as well. The monomer tables are similar and your IT group should be able to interconvert as necessary. Helms will take the form of the following.

Sequences_5.PNG
Previous
Previous

How to Prepare Monomer Tables for Biologics Research

Next
Next

_