PowerShell Remoting Project Home

Tuesday, May 30, 2006

Download Gene Sequences Using NCBI eFetch Tools

Recently, I was working on a bioinformatics research project which needed to download hundreds of gene mRNA sequences. I have all the gene IDs in one text file. So a simple PowerShell Script could solve my problem.

I have a old post talking about NCBI Entrez eUtils tools. Today, I will use the eFetch tool (also included in eUtils). The script is simple and self-explaining.
# ===========================================================================
#
# Author:      Tony (http://MSHForFun.blogspot.com)
# File:        Efetch.ps1
# Description: Download gene sequences using NCBI eUtils.eFetch tool
# Reference: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl
# Reference: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
# Reference: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

# ===========================================================================
param
(
  [string] $Path=$(throw "Please Specify a file")
)
$BaseURL = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id="
$Option= "&rettype=fasta&retmode=text"
$WebClient = new-object System.Net.WebClient
$SavePath = $Path + ".result"
if (test-path $savePath)
{
  del $SavePath
}
foreach ( $id in (get-content $path))
{
  # Construct eFetch URL
  $URL=$BaseURL + $id + $Option
  Write-Progress -Activity "Download Sequences" -Status "Submit gene $Id"
  # Submit and download data
  $Data = $WebClient.DownloadString($URL)
  # Parse Data
  if ($Data.Length -gt 1)
  {
    Write-Progress -Activity "Download Sequences" -Status "$id OK"
    # Write to Console
    $data
    # Wrtie To file
    $data >> $SavePath
  }
  else
  {
    Write-Progress -Activity "Download Sequences" -Status "$Id is not found!"
    "$Id is not found!`n`r"
    "$Id is not found!`n`r" >> $SavePath
  }
  # Try not to overload NCBI Server
  start-sleep 1
}
# Clear Progress pane
Write-Progress -Activity "Download Sequences" -Status "Done" -completed
You need a text file (genes.txt) to test this script:
0
NM_008176
NM_009140
NM_009141
NM_011333
NM_013654
NM_016960
NM_009142
NM_008491
NM_031168
NM_009883
NM_007679
NM_010030
NM_009971
NM_010809
NM_008607
NM_030612
NM_011198
NM_007987
If you are a biologist, you can see what kind of genes I am intersted in. The first "0" is just to cause an "Not Found" Error. You can run this script like following:
.\efetch.ps1 genes.txt
Your results is printed to screen as well as "genes.txt.result" file.

Have Fun

Tags:       


Comments:

Post a Comment





<< Home