MyGenBank Documentation
Author: Ian Korf
Last modified: 2001-07-06
Introduction
Database Specification
Setting up MyGenBank
Querying MyGenBank
Introduction
MyGenBank is a simple package for managing a local copy of GenBank in MySQL.
Anyone interested in MyGenBank should first read the most recent GenBank release notes and
perhaps also see the DDJB/EMBL/GenBank Feature table
defintion and taxonomy definitions
(see the names.dmp file). MyGenBank consists of two main components:
- the administration tool mygb_admin
- the querying tools mygb_fetch and mygb_query
Database Specification
MyGenBank exists as a single table containing some of the most important
sequence attributes. However, the sequence is not stored in the
database. To get the raw sequence, Fasta file, or GenBank flat file, you
use the mygb_fetch tool (see below).
Column |
Type |
Attributes |
Indexed |
accession |
VARCHAR(8) |
NOT NULL, PRIMARY KEY |
yes |
version |
INT1 |
NOT NULL |
no |
gi |
INT4 |
NOT NULL, UNIQUE |
yes |
length |
INT3 |
NOT NULL, |
yes |
date |
DATE |
NOT NULL |
yes |
taxid |
INT3 |
NOT NULL |
yes |
mol_type |
ENUM |
NOT NULL |
yes |
division |
ENUM |
NOT NULL |
yes |
keywords |
SET |
|
no |
features |
SET |
|
no |
file |
VARCHAR(6) |
NOT_NULL |
no |
fasta |
INT4 |
NOT_NULL |
no |
genbank |
INT4 |
NOT_NULL |
no |
Enums and Sets
The division enums are determined during setup. They are stored in the
$MYGENBANK_DATA/Definition directory. The keywords, features, and mol_types are
stored in the $MYGENBANK_CODE/def directory and terse copies are made during
setup and saved to the $MYGENBANK_DATA/Definition directory. You may edit the
keywords and features files to fit your own criteria, see the directions in
each of the files in $MYGENBANK_CODE/def. The default keywords, features,
mol_types, and divisions are given below.
- mol_type - note that I am including the ds- or ss- in the mol_type
rather than store a separate value. Also, the CIRCULAR tag is omitted.
-
AA
ds-RNA
ds-mRNA
ds-rRNA
mRNA
ms-DNA
ms-RNA
rRNA
scRNA
ss-DNA
ss-RNA
tRNA
uRNA
- division
-
BCT
EST
GSS
HTG
INV
MAM
PAT
PHG
PLN
PRI
ROD
STS
SYN
UNA
VRL
VRT
- keywords - this is my chosen set of keywords (from the KEYWORDS
field).
-
EST
HTG
HTGS_PHASE0
HTGS_PHASE1
HTGS_PHASE2
HTGS_PHASE3
HTGS_DRAFT
GSS
STS
- features - this is my set of features, not the entire GenBank set.
There is a maximum of 63 features, which is less than what is present in
GenBank.
-
3_UTR
3_clip
5_UTR
5_clip
CAAT_signal
CDS
GC_signal
RBS
STS
TATA_signal
conflict
enhancer
exon
gene
intron
mRNA
mat_peptide
misc_RNA
misc_binding
misc_signal
misc_structure
modified_base
polyA_signal
polyA_site
precursor_RNA
prim_transcript
promoter
protein_bind
rRNA
repeat_region
satellite
scRNA
sig_peptide
snRNA
stem_loop
tRNA
terminator
transit_peptide
transposon
unsure
variation
Setting up MyGenBank
Environment Variables
You need to set two envirionment variables. You should probably add these to
your login scripts.
- MYGENBANK_CODE
- This should point to the directory where this
documentation exists. You should find 4 subdirectories here: arch, bin, lib,
and def.
- MYGENBANK_DATA
- This should point to a directory where MyGenBank will
store its files. Four directories will be created here: Definition, GenBank,
Fasta, and Table.
$MYGENBANK_CODE Directory
- arch
- An archive of code that is not necessary to run the current
version of MyGenBank.
- bin
- Contains the executable files for MyGenBank. Currently, this
contains mygb_admin, mygb_fetch, and mygb_query.
- def
- Contains the default keywords, features, and mol_types which are
stored as SET types in MyGenBank. These files may be edited to capture more
sequence attributes. Keywords are parsed from the GenBank KEYWORD lines and
features are parsed from the "feature keys" in the feature table. See the Feature table definition for
more information.
- lib
- Should contain the GBlite.pm perl module that is used for parsing
GenBank flat files. This module is available from the lib_ikorf collection.
$MYGENBANK_DATA Directory
- Defintion
- Contains files for keywords, features, divisions,
mol_types, and filenames. keywords and features are copied from
$MYGENBANK_CODE/def. divisons, mol_types, and filenames are created by
the "mygb_admin parse" command. Also contains the *.sql files. The
MyGenBank.sql file contains the database definition. Other *.sql files
correspond to the individual GenBank files.
- Fasta
- Contains the Fasta files corresponding to the sequence(s)
from the GenBank flat file. The files are created by the "mygb_admin
parse" command.
- GenBank
- Contains GenBank flat files downloaded from the NCBI. The files
are created by the "mygb_admin ftp" command.
- Table
- Contains tab-delimited data for bulk loading into MySQL. The files
are created by the "mygb_admin parse" command.
External Dependencies
Before you begin, you must have MySQL and Perl installed. You will also need
the libnet modules (just Net::FTP actually) as well as the MySQL DBI for Perl.
You can find these components at mysql.com and
CPAN.
Space Requirements
You're going to need a lot of space. GenBank is continually growing. See
the release notes to find out how big the flat files are for the latest
release. You need to add about 1/3 more than this for the Fasta versions
of the files. If you plan on doing incremental updates, you need to take
this into account too (see growth of GenBank in the release notes). And if
you want to make BLAST-able databases, you need space for that too. I am
currently using a single 73 Gb drive for all the files, but this will be
insufficient by next year.
mygb_admin
The mygb_admin tool is used to build MyGenBank. The first time you try building
MyGenBank, you may wish to use the -t switch to enter testing mode. This will
process just one GenBank file, which will allow you to determine if your
environment is set up correctly before wasting a lot of download and cpu time.
mygb_admin setup
mygb_admin -t ftp
mygb_admin -t parse
mygb_admin -t build
mygb_admin -t test
If you plan on doing incremental updates, you should test this too.
mygb_admin -t update
mygb_admin -t test
mygb_admin commands
- setup
- creates the directories in $MYGENBANK_DATA if
necessary
- copies definitions from $MYGENBANK_CODE/def to
$MYGENBANK_DATA/Definition
- creates filenames and divisions files in
$MYGENBANK_DATA/Definition
- ftp
- reads the filenames from
$MYGENBANK_DATA/Definitino/filenames
- skips files already transferred (checks
for existence in $MYGENBANK_DATA/GenBank)
- downloads each file, pipes it to
gunzip, and saves it to $MYGENBANK/_DATA/GenBank
- parse
- reads the filenames from
$MYGENBANK_DATA/GenBank
- skips files already parsed (checks for existence in
$MYGENBANK_DATA/Table)
- parses each GenBank file
- creates a corresponding
fasta file in $MYGENBANK_DATA/Fasta
- creates a corresponding tab-delimited
file in $MYGENBANK_DATA/Table
- build
- reads filenames from $MYGENBANK_DATA/Table
and skips files already loaded into MySQL (checks for existence in
$MYGENBANK_DATA/Definition)
- defines the MyGenBank table (see the
$MYGENBANK_DATA/Definition/MyGenBank.sql file)
- loads each tab-delimited
file in $MYGENBANK_DATA/Table
- test
- runs some simple querries on MyGenBank, see the
section on querrying below
- update
- gets a list of all GenBank update files from
NCBI
- ftp (unless already downloaded)
- parse (unless already
parsed)
- build (unless already built)B
You may put the commands together on a single line, and the typical command
line for a test build of MyGenBank would look like this:
mygb_admin -t setup ftp parse build test
If everything works, then you should use the following command line:
mygb_admin setup ftp parse build test >& logfile &
This may take some time, the exact amount will depend on your network,
cpu, filesystem, and size of GenBank. On my workstation (900 MHz, 512 Mb,
73Gb 10K SCSI, 400-600 Kb/sec bandwith) with release 120, I was able to
build MyGenBank in about 6-10 hours depending upon traffic and if I was
also including the updates.
If the build stops for some reason, like network failure, you can restart it
again and it won't download files previously fetched (see the command details
above). You may have to delete the last file created if it has errors. If
you're logging STDERR as shown above, you should be able to find the file
without any problems.
If you want to do incremental updates, you can use the following command:
mygb_admin update >& update_log &
Querying MyGenBank
There are two command line tools for interacting with MyGenBank. These are
explained below.
mygb_fetch
mygb_fetch is used for retrieving sequences in raw, Fasta, or GenBank format,
singly or in batches. You may specify accesion numbers, gi numbers, or query
strings. The default format is Fasta. For example, to fetch a single specific
sequence, gi=23456, in fasta format, you would type:
mygb_fetch 23456
You could also retrieve that entry by its accession:
mygb_fetch Z16870
Or with an abbreiviated SQL statement (without the "select ... from ..."
precedent, and don't forget the inner quotes for strings):
mygb_fetch "gi = 23456"
mygb_fetch "accession = 'Z16870'"
You can also retrieve multiple sequences by including multiple arguments on the
command line:
mygb_fetch 23456 45678
You can retrieve a batch of sequences with abbreviated SQL syntax. Here's how
you build a Fasta database of all the human transcripts:
mygb_fetch "taxid = 9606 and mol_type = 'mRNA'" > human_tx.fasta
You can even mix and match if you like:
mygb_fetch Z16870 45678 "mol_type = 'uRNA'"
Here's how to build fasta database of all human sequences with annotated coding
sequences that have been deposited in GenBank since March 15th, 2000. Note the
use of "find_in_set" which is used for querying features and keywords.
mygb_fetch "taxid = 9606 and find_in_set('CDS',features) and date > 2000-03-15"
You can get the sequence in raw format or GenBank flat file format using the -r
and -g switches (and be explicit about fasta if you like):
mygb_fetch -r 23456
mygb_fetch -g 23456
mygb_fetch -f 23456
You may find that the data in MyGenBank is limiting. For example, you might
want to know who the authors of the sequences are. To do this, you can process
the flat files as a post-processing step with UNIX shell tools, with the
GBlite.pm Perl module included in the $MYGENBANK_CODE/lib directory, or with
other tools, such as those found at bioperl.
For archival/publication reasons, you may want to exclude update sequences so
you can just say "we used Release 120". You can either build without updates or
use the -u switch in mygb_fetch.
mygb_fetch -u "division = 'EST'"
mygb_query
mygb_query is used for retrieving tab-delimited columns of data from the
database. To use it, you give it straight SQL. These are the commands issued by
"mygb_admin test":
"select COUNT(*) from MyGenBank",
"select accession, mol_type, date, taxid from MyGenBank limit 5",
"select length, accession from MyGenBank where length < 10 limit 5",
"select COUNT(*) from MyGenBank where division='BCT'",
"select COUNT(*) from MyGenBank where find_in_set('HTG', keywords)",
"select COUNT(*) from MyGenBank where find_in_set('CDS', features)",