Reading and Writing of Basic sequence file formats

TODO: needs more examples

FASTA

A quick way of parsing a FASTA file is using the FastaReaderHelper class.

Here an example that parses a UniProt FASTA file into a protein sequence.

public static ProteinSequence getSequenceForId(String uniProtId) throws Exception {
		URL uniprotFasta = new URL(String.format("https://www.uniprot.org/uniprot/%s.fasta", uniProtId));
		ProteinSequence seq = FastaReaderHelper.readFastaProteinSequence(uniprotFasta.openStream()).get(uniProtId);
		System.out.printf("id : %s %s%s%s", uniProtId, seq, System.getProperty("line.separator"), seq.getOriginalHeader());
		System.out.println();

		return seq;
	}

BioJava can also be used to parse large FASTA files. The example below can parse a 1GB (compressed) version of TREMBL with standard memory settings.

    
    
    
     /** Download a large file, e.g. ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
     * and pass in path to local location of file
     *
     * @param args
     */
        public static void main(String[] args) {

            if ( args.length < 1) {
                System.err.println("First argument needs to be path to fasta file");
                return;
            }

            File f = new File(args[0]);

            if ( ! f.exists()) {
                System.err.println("File does not exist " + args[0]);
                return;
            }

            try {

                // automatically uncompresses files using InputStreamProvider
                InputStreamProvider isp = new InputStreamProvider();
                
                InputStream inStream = isp.getInputStream(f);
                
                FastaReader<ProteinSequence, AminoAcidCompound> fastaReader = new FastaReader<ProteinSequence, AminoAcidCompound>(
                        inStream,
                        new GenericFastaHeaderParser<ProteinSequence, AminoAcidCompound>(),
                        new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()));
                
                LinkedHashMap<String, ProteinSequence> b;


                int nrSeq = 0;
                
                while ((b = fastaReader.process(10)) != null) {
                    for (String key : b.keySet()) {
                        nrSeq++;
                        System.out.println(nrSeq + " : " + key + " " + b.get(key));
                    }

                }
            } catch (Exception ex) {
                Logger.getLogger(ParseFastaFileDemo.class.getName()).log(Level.SEVERE, null, ex);
            }
        }

BioJava can also process large FASTA files using the Java streams API.

    FastaStreamer
        .from(path)
        .stream()
        .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));

If you need to specify a header parser other that GenericFastaHeaderParser or a sequence creater other than a ProteinSequenceCreator, these can be specified before streaming the contents as follows:

    FastaStreamer
       .from(path)
       .withHeaderParser(new PlainFastaHeaderParser<>())
       .withSequenceCreator(new CasePreservingProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet()))
       .stream()
       .forEach(sequence -> System.out.printf("%s -> %ss\n", sequence.getOriginalHeader(), sequence.getSequenceAsString()));

Navigation: Home | Book 1: The Core Module | Chapter 3 : Reading and Writing sequences

Prev: Chapter 2 : Basic Sequence types

Next: Chapter 4 : Translating

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readwrite.md

readwrite.md

Reading and Writing of Basic sequence file formats

FASTA

Files

readwrite.md

Latest commit

History

readwrite.md

File metadata and controls

Reading and Writing of Basic sequence file formats

FASTA