2. proBAM

proBAM description

One of the main goals of proteomics is to identify and quantify proteins in complex biological samples. This is achieved using mass spectrometry (MS) as a major analytical tool and so-called sequence search engines as the bioinformatics interpretation tool. Sequence search engines aim at matching experimental MS spectra with in silico generated fragmentation spectra from peptide sequences from a protein or nucleotide sequence database. There is great interest in exploring the identified peptides in the context of the genome of the species from which the sample was derived. Furthermore, researchers want to smoothly integrate proteomics results with the results of corresponding transcriptomics data. An important goal of this proBAM format is to achieve a greater interoperability between proteomics and genomics software tools. The protein BAM (proBAM) file format is designed for storing and analysing peptide spectrum matches (PSMs) within the context of the genome. proBAM is built upon the SAM format and its compressed binary version, BAM, with necessary modifications to accommodate information specific to proteomics data such as PSM scores and confidence, charge states and peptide level modifications, both artificial and PTMs (post-translational modifications).


The latest version of this specification files, examples, and additional information are available from the HUPO PSI website (http://www.psidev.info/proBAM).


specifications

The proBAM format is adapted from the SAM format and is a Tab-delimited text format containing a header and an alignment section. The lines of the header section start with ‘@’, in contrast to the alignment lines.

proBAM header

In the header, each line is Tab-delimited. Except for the @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-letter string that defines the content and the format of the VALUE. As described in Table 1, the TAGs @HD, @SQ, @PG and @CO are inherited from the SAM format. However, the TAG @GA is unique to a proBAM file, describing the genome annotation reference system used to generate the proBAM file, e.g. GENCODE v19, ENSEMBL 77, RefSeq 71. If RefSeq annotation (downloaded from the UCSC table browser) is used, the date of the download should be specified. Finally, the software used to generate the proBAM file should be specified in the header by using the TAG @PG. A table describing the header tags and a proBAM header example is provided below.



proBAM header section

Record        Tag      Description
@HD VN* SAM/BAM Format version. Accepted format: /^[0-9]+\.[0-9]+$/
@HD SO Sorting order of alignments. Valid values: unknown (default), unsorted, queryname and coordinate. For coordinate sort, the major sort key is the RNAME field, with the order defined by the order of @SQ lines in the header. The minor sort key is the POS field. For alignments with equal RNAME and POS, order is arbitrary. All alignments with ‘*’ in RNAME field follow alignments with some other value but otherwise are in arbitrary order.
@SQ SN Reference sequence name. Each @SQ line must have a unique SN tag. The value of this field is used in the alignment records in the RNAME field. Regular expression: [!-)+-<>-~][!-~]*/
@SQ LN Reference sequence length. Range: [1,231-1]
@SQ AS Genome assembly identifier
@SQ SP Species
@PG ID Program identifier, the value of ID is used to generate the proBAM file
@PG VN Program version
@PG CL Command line
@GA AS Annotation source. Valid values: GENCODE, ENSEMBL, REFSEQ
@GA VN annotation version. Valid values: 19 (Gencode release), 81 (Ensembl release), 71 (RefSeq release)
@GA AD Annotation description
@CO One-line text comment (unordered and mulitple @CO lines are allowed

proBAM header example


@HD VN:1.0 SO:coordinate @SQ      SN:chr1      LN:249250621
@SQ      SN:chr10      LN:135534747
@SQ      SN:chr11      LN:135006516
@SQ      SN:chr12      LN:133851895
@SQ      SN:chr13      LN:115169878
@SQ      SN:chr14      LN:107349540
@SQ      SN:chr15      LN:102531392
@SQ      SN:chr16      LN:90354753
@SQ      SN:chr17      LN:81195210
@SQ      SN:chr18      LN:78077248
@SQ      SN:chr19      LN:59128983
@SQ      SN:chr2      LN:243199373
@SQ      SN:chr20      LN:63025520
@SQ      SN:chr21      LN:48129895
@SQ      SN:chr22      LN:51304566
@SQ      SN:chr3      LN:198022430
@SQ      SN:chr4      LN:191154276
@SQ      SN:chr5      LN:180915260
@SQ      SN:chr6      LN:171115067
@SQ      SN:chr7      LN:159138663
@SQ      SN:chr8      LN:146364022
@SQ      SN:chr9      LN:141213431
@SQ      SN:chrM      LN:16571
@SQ      SN:chrX      LN:155270560
@SQ      SN:chrY      LN:59373566
@PG      ID:proBAMr VN:1.0 CL:
@GA      AS:GENCODE VN:19
@CO      proBAM file example


proBAM alignment section

Each alignment line has 11 mandatory fields (Table 2), adopted from the SAM format, for essential alignment information such as mapping position and additional mandatory fields to accommodate unique features of proteomics data (Table 3). For instance, we introduce the ‘XM’ field to store peptide modification information (both artefactual and PTMs), the ‘XS’ field to store PSM scores, and the ‘XC’ field to store peptide charge state information. All fields follow the “TAG: TYPE: VALUE” format, which is similar to those in a BAM file. A table describing the alignment section fields is provided.

standard columns adapted from SAM/BAM

no. name type description default value
1 QNAME string spectrum name *
2 FLAG int bitwise FLAG (see further) *
3 RNAME string reference sequence name *
4 POS int 1-based lefmost mapping position 0
5 MAPQ int unused in proBAM 255
6 CIGAR string extended cigar string (see further) *
7 RNEXT string unused in proBAM *
8 PNEXT int unused in proBAM 0
9 TLEN int unused in proBAM 0
10 SEQ string coding sequence *
11 QUAL string unused in proBAM *

additional (proBAM) columns

tag    type    description
NH int number of genomic locations to chich the peptide sequence maps
XA float Mass error (experimental - calculated)
XB int number of peptides to which the spectrum maps
XC int peptide charge
XE int enzyme used (see proBAM specification file for full list of enzyme and their corresponding integer coding)
XF int Reading frame of the peptide
XG string peptide type:
N: normal peptide;
V: variant peptide;
W: indel peptide;
J: novel junction peptide;
A: alternative junction peptide;
M: novel exon peptide;
C: cross junction peptide;
E: extension peptide;
B: 3’UTR peptide;
O: out-of-frame peptide;
T: truncation peptide;
R: reverse strand peptide;
I: intron peptide;
G: gene fusion peptide;
D: decoy peptide;
U: unmapped
X: unknown
XI float peptide intensity
XL int number of peptides to which the spectrum maps
XM string Modification(s): semicolon-separated list of modifications with the following format (adopted from the mzTab format): {position}-[{modification identifier}|{neutral loss}] The position gives the peptide position starting from 1. An N- terminal modification will be specified at position 0. Valid modification identifiers are either PSI-MOD or UNIMOD accessions (including the “MOD:” / “UNIMOD:” prefix). In the case of an “unknown” modification not included in either UNIMOD or PSI-MOD.
XN int number of missed cleavages
XO string This field indicates the uniqueness of the peptide mapping: 1:unique
2:not_unique[super-set]
3:not_unique[same-set]
4:not_unique[subset]
5:not_unique[conflict]
0:not_unique[unknown]
XP string Peptide sequence from the original search result (may differfrom field XR due to variations)
XQ float PSM q-value
XR string Reference peptide sequence, reference genomic sequence is used to derive reference peptide sequence if no reference protein sequence is available
X float PSM score
XT int Enzyme specificity considered in the search
0: non- enzymatic;
1: semi- enzymatic;
2: fully-enzymatic
3: unknown
XU string A URI (Uniform Resource Identifier) pointing to the file's source data e.g. the website or FTP location for a given dataset (i.e. a dataset in PRIDE Archive http://www.ebi.ac.uk/pride/archive/projects/PXD000764), or the original file that was converted to proBAM Another possibility would be to add the URI of the specific PSM reported in a given proBAM line
YA string Following amino acids (2 AA, A stands for after)
YB string Preceding amino acids (2 AA, B stands for before)
YP string Protein accession ID from the original search result

FLAG value

In the proBAM format, PSMs replace sequence short reads as the basic data unit. proBAM allows for 5 FLAG values to describe peptide mapping information (table below). It should be noted that, in the original SAM/BAM format, 0x400 was used to indicate the PCR or optical duplicate whereas in proBAM it is used to represent decoy peptides



bit description FLAG
0x00 peptide maps to the forward strand 0
0x10 peptide maps to the reverse strand 16
0x100 peptide is not the rank=1 peptide for the spectrum 256
0x4 unmapped peptide 4

CIGAR STRING

Used by the SAM/BAM format, the CIGAR strings are given in the following table. The CIGAR string is used to describe peptide alignments to the genome, and the various operations are analogous to the short read sequence alignments. A table below summarizes the CIGAR operations.



operation description
M alignment match (including a sequence match and a mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference (represents an intron)


The proBAMconvert tool converts peptide identification files (mzIdentML, pepXML and mzTab) to proBAM by mapping the identified peptides onto a genome. In the following chapters an overview is given to familiarize users with the different aspects of proBAMconvert.


Continue to the the next chapter "proBAM" where installation instruction for proBAMconvert are provided.


Next Chapter: installation