TODO: needs more examples
A quick way of parsing a FASTA file is using the FastaReaderHelper class.
Here an example that parses a UniProt FASTA file into a protein sequence.
public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
System.out.println();
return seq;
}
BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings.
/** Download a large file, e.g. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
* and pass in path to local location of file
*
* @param args
*/
public static void main(String[] args) {
if ( args.length < 1) {
System.err.println("First argument needs to be path to fasta file");
return;
}
File f = new File(args[0]);
if ( ! f.exists()) {
System.err.println("File does not exist " + args[0]);
return;
}
try {
// automatically uncompresses files using InputStreamProvider
InputStreamProvider isp = new InputStreamProvider();
InputStream inStream = isp.getInputStream(f);
FastaReader<ProteinSequence, AminoAcidCompound> fastaReader = new FastaReader<ProteinSequence, AminoAcidCompound>(
inStream,
new GenericFastaHeaderParser<ProteinSequence, AminoAcidCompound>(),
new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()));
LinkedHashMap<String, ProteinSequence> b;
int nrSeq = 0;
while ((b = fastaReader.process(10)) != null) {
for (String key : b.keySet()) {
nrSeq++;
System.out.println(nrSeq + " : " + key + " " + b.get(key));
}
}
} catch (Exception ex) {
Logger.getLogger(ParseFastaFileDemo.class.getName()).log(Level.SEVERE, null, ex);
}
}
BioJava can also process large FASTA files using the Java streams API.
FastaStreamer
.from(path)
.stream()
.forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
If you need to specify a header parser other that GenericFastaHeaderParser
or a sequence creater other than a
ProteinSequenceCreator
, these can be specified before streaming the contents as follows:
FastaStreamer
.from(path)
.withHeaderParser(new PlainFastaHeaderParser<>())
.withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
.stream()
.forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));
Navigation: Home | Book 1: The Core Module | Chapter 3 : Reading and Writing sequences
Prev: Chapter 2 : Basic Sequence types
Next: Chapter 4 : Translating