Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set tab delimiter in manpage for tabix GFF3 sort #1457

Merged
merged 3 commits into from
Jul 4, 2022

Conversation

cmdcolin
Copy link
Contributor

@cmdcolin cmdcolin commented Jun 17, 2022

This can help if there are spaces in the GFF3 file e.g. in column 2 or 3. Was found in the wild by my co-worker on a gff file here https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_genomic.gff.gz

Another alternative but similar command is

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -t\"\t\" -k1,1 -k4,4n"}' file.gff > file.sorted.gff

But, the PR here keeps it as is

This can help if there are spaces in the GFF3 file e.g. in column 2 or 3
@daviesrob daviesrob self-assigned this Jun 21, 2022
@daviesrob
Copy link
Member

This is a good idea, but $'\t' to get the tab is a bash-ism. "`printf '\t'`" is more portable, and even works in csh.

I quite like the awk version as it only makes one pass through the file, although sadly it seems to be slightly slower than making two passes with grep. Presumably that's because awk is doing more work splitting the input up.

@cmdcolin
Copy link
Contributor Author

cmdcolin commented Jul 2, 2022

as long as we are looking at compatibility, the current one actually has issue in zsh (needed the ^ to be inside of quotes I think) so updated again :)

also incorporated the printf \t! good catch

@cmdcolin
Copy link
Contributor Author

cmdcolin commented Jul 2, 2022

also interesting that the awk is a bit slower. also, if could be helpful to make typing out the command easier. for that, can potentially offer people a bash command like:

function gffsort() {
  grep "^#" $1;
  grep -v "^#" $1 | sort -t"`printf '\t'`" -k1,1 -k4,4n;
}

then

gffsort input.gff | bgzip> out.gff.gz
tabix -p gff out.gff.gz

@daviesrob
Copy link
Member

The update is mostly OK, apart from the \t got interpreted as a troff macro, and disappeared from the formatted output. I'll push a tiny fix.

The slowness with awk is most likely due to it splitting the entire line into fields. grep can get away with just looking at the first character, which is much faster.

@daviesrob daviesrob merged commit ca34d9e into samtools:develop Jul 4, 2022
@cmdcolin cmdcolin deleted the patch-1 branch July 4, 2022 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants