Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > MSDB   
 
 

Sequence Database Setup: MSDB

Overview

The MSDB database is compiled by the Proteomics Group at Imperial College London. MSDB is a composite, non-identical protein sequence database built from a number of primary source databases. Sequences from the higher priority databases are preferentially retained. The present source databases (in order of priority) are: PIR, Trembl, GenBank, Swiss-Prot, and NRL3D.

One of the main advantages of MSDB is the accompanying full-text reference file.

Download

ftp://ftp.ebi.ac.uk/pub/databases/MassSpecDB/
ftp://ftp.ncbi.nih.gov/repository/MSDB/
ftp://csc-fserve.hh.med.ic.ac.uk/pub/

You should download three files: the Fasta database (msdb.fasta.Z), release notes (msdb.nam.Z), and a reference file. There is a choice of reference files: a complete reference file including Swiss-Prot annotation text, (msdb.ref.complete.Z), and a file in which the Swiss-Prot annotations have been replaced by links to the Expasy web site, (msdb.ref.Z). This dates from when a licence was required for commercial use of Swiss-Prot.

To download updates automatically, the relevant definition block in db_update.pl is either MSDB_from_EBI or MSDB_from_NCBI.

Taxonomy

Taxonomy for MSDB is predefined in mascot.dat, choose "MSDB REF". The following taxonomy files are required:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory. Also, some files need to be unpacked (using tar) as well as uncompressed.

Parse Rules

A typical Fasta title line is:

>B32382 fcbH bifunctional protein - Bradyrhizobium japonicum

The single identifier varies according to the source database. Suitable parse rules are:

Accession from Fasta title: ">\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

The corresponding line in the Ref file is:

>P1;B32382

Accession from Ref file: ">[A-Z][0-9];\([^ ]*\)"

Configuration

For this example, the three files were downloaded to a folder named C:\Inetpub\MASCOT\sequence\MSDB\current. The files were decompressed using gzip, and renamed to MSDB_20020515.fasta, ref, and nam.

When updating an active database, it is important to rename the Fasta file last, because Mascot will begin database exchange as soon as it sees a new Fasta file that matches the wildcard path for the database.

Mascot database maintenance utility

If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
--- no full text report ---
in the drop down list.

Always test a new definition before applying the changes to mascot.dat.

 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.