PowerShell Remoting Project Home

Wednesday, December 28, 2005

NCBI Blastn under MSH command line

From NCBI web site:
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Blastn is one of my most frequently used tools from NCBI. In stead of going to their web site, I can now run blastn under command line. It is convenient and even faster than web interfaces. Although I use "nr" database here, you can change the URL string to use other database or other blast program.

As for scripting, nothing magic here. Just rewrite original perl script using .NET class: System.Net.WebClient . For Regular Expression, we can use "-match" expression.
Thanks for Lee Holmes Blog

# Begin of script
# =============================================================
# This code is for test purposes only. Use it at your own risk.
# Please do not submit or retrieve more than one request every
# two seconds. Results will be kept at NCBI for 24 hours. For
# best batch performance,they recommend that you submit requests
# after 2000 EST (0100 GMT) and retrieve results before 0500 EST
# (1000 GMT).
# reference: http://www.ncbi.nlm.nih.gov/blast/Doc/urlapi.html
# reference: http://www.ncbi.nlm.nih.gov/blast/docs/web_blast.pl
# =============================================================
# return codes:
# 0 – success
# 1 - invalid arguments
# 2 - no hits found
# 3 - rid expired
# 4 - search failed
# 5 - unknown error
# =========================================================
param([string] $query)

if (-not $query)
{
"Please specify query seqence!"
return 1
}

#Submit query sequence
"================================================================="
"Query sequence:"
""
$query
""
"Submit query sequence..."
$uri="http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put&PROGRAM=blastn&DATABASE=nr&QUERY=" + $query
$BlastClient = new-object System.Net.WebClient
$pagecontent = $BlastClient.DownloadString($uri);
"================================================================="
# Get RID
$pagecontent -match " RID = (.*)"
$RID=$Matches[1]
"RID=" + $RID
" "
$pagecontent -match " RTOE = (.*)"
$TimeToComplete= $Matches[1]
"Time to complete search: " + $TimeToComplete + " seconds or sooner. Waiting..."
" "
Start-sleep $TimeToComplete

#Waiting for results
While ($true)
{
Start-sleep 5
$uri= "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&FORMAT_OBJECT=SearchInfo&RID=" + $RID

$pagecontent = $BlastClient.DownloadString($uri);

if ($pagecontent -match "Status=WAITING")
{
"Results not ready, waiting..."
"================================================================="
continue
}

if ($pagecontent -match "Status=FAILED")
{
"Search failed, exiting..."
return 2
}

if ($pagecontent -match "Status=UNKNOWN")
{
"Search expired, exiting..."
return 3
}

if ($pagecontent -match "Status=READY")
{
if ($pagecontent -match "ThereAreHits=yes")
{
"Search complete, retrieving results...";
"================================================================="
break
}
else
{
"No hits found.\n";
return 4
}
}
return 5
}

# Get results
$uri= "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&RID=" + $RID + "&ALIGNMENTS=500&ALIGNMENT_VIEW=QueryAnchored&FORMATOBJECT=Alignment&FORMAT_TYPE=TEXT"
$pagecontent = $BlastClient.DownloadString($uri);
$pagecontent >.\blast.results
Get-content .\blast.results | more
return 0
#End of script

So if we do this under MSH prompt:
. blastn-nr.msh AAAAAAAAGGGGGGCCCCCTTTT


the output would like:

==============================================================
Query sequence: AAAAAAAAGGGGGGCCCCCTTTT
Submit query sequence...
RID=1135712055-18130-135523741431.BLASTQ1
Time to complete search: 12 seconds or sooner. Waiting...
Results not ready, waiting...
Search complete, retrieving results...
=============================================================

BLASTN 2.2.12 [Aug-07-2005]Reference: Altschul, Stephen F., Thomas L. Madden,
Alejandro A. Schffer,Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation ofprotein database search
programs", Nucleic Acids Res. 25:3389-3402.

RID: 1135712055-18130-135523741431.BLASTQ1

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,GSS,environmental
samples or phase 0, 1 or 2 HTGS sequences) 3,659,359 sequences; 16,326,075,880
total letters

Query= (23 letters)

Score ESequences producing significant alignments: (Bits) Valued
AK202294.1 Mus musculus cDNA, clone:Y1G0140P22, strand:m... 40.1 0.067gb
AF467007.1AF467007S1 Homo sapiens histidine N-methyltrans... 38.2 0.27
dbjAK192209.1 Mus musculus cDNA, clone:Y1G0108D05, strand:m... 38.2 0.27
embAJ333598.1HSA333598 Homo sapiens genomic sequence surrou... 38.2 0.27
embAJ343149.1HSA343149 Homo sapiens genomic sequence surrou... 38.2 0.27
embAJ343845.1HSA343845 Homo sapiens genomic sequence surrou... 38.2 0.27
gbAY811874.1 Schistosoma japonicum SJCHGC01957 protein mRNA, p 36.2 1.0
dbjAB232923.1 Oryzias latipes hox gene cluster, complete cds, 36.2 1.0
dbjBS000120.2 Pan troglodytes chromosome 22 clone:RP43-007D... 36.2 1.0
......

>gbAY811874.1 Schistosoma japonicum SJCHGC01957 protein mRNA, partial cds
Length=718
Score = 36.2 bits (18),
Expect = 1.0
Identities = 18/18 (100%),
Gaps = 0/18 (0%)
Strand=Plus/Minus
Query 2 AAAAAAAGGGGGGCCCCC 19
Sbjct 156 AAAAAAAGGGGGGCCCCC 139
......

Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,
GSS,environmentalsamples or phase 0, 1 or 2 HTGS sequences)
Posted date: Dec 26, 2005 4:16 AM
Number of letters in database: -853,793,300
Number of sequences in database: 3,659,359
Lambda K H 1.37 0.711 1.31
GappedLambda K H 1.37 0.711 1.31
Matrix: blastn matrix:1 -3
Gap Penalties:
Existence: 5,
Extension: 2
Number of Sequences: 3659359
Number of Hits to DB: 2403072
Number of extensions: 36138
Number of successful extensions: 36138
Number of sequences better than 10: 21
Number of HSP's better than 10 without gapping: 21
Number of HSP's gapped: 36138
Number of HSP's successfully gapped: 21
Number of extra gapped extensions for HSPs above 10: 36070
Length of query: 23
Length of database: 16326075880
Length adjustment: 18
Effective length of query: 5
Effective length of database: 16260207418
Effective search space: 81301037090
Effective search space used: 81301037090
A: 0X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
X3: 25 (49.6 bits)
S1: 11 (22.3 bits)
S2: 17 (34.2 bits)
====================================================
Have fun!

[Edit: Monad has now been renamed to Windows PowerShell. This script or discussion may require slight adjustments before it applies directly to newer builds.]

Tags:       


Comments:
your post is helpful and informative
 
Really very good blog for wow gold its really excellent. The content of your blog Is really superb.
 

Post a Comment





<< Home