Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to work with vcf 4.4 prefixed phasing info #1861

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Nov 26, 2024

update to work with vcf 4.4 specification, on prefixed phasing indicators.

vcf files with version < 4.4 are handled as earlier and changes apply only to v4.4 and later.
when explicit phasing indicator present, it is used.
when there are no explicit phasing indicator, phasing status is identified as given in 6.3.3 of vcf4.4 format.
haploids considered implicitly phased and others based on presence of other unphased alleles.

outputs will have prefixed phasing indicator when phasing can not be correctly inferred without it being present.
e.g. unphased haploid - though not found usually, unphased 1st allele with all other alleles phased, phased 1st allele with another unphased allele etc.

A new bcf_format_gt1 method, derived from bcf_format_gt does this output formatting. bcftools may have to use this method to handle vcf 4.4 with prefixed phasing info.
A version enumeration type is added and a method to convert file format header to enumerated version also added.

Fixes #1847

htslib/vcf.h Outdated
* Explicit / prefixed phasing for 1st allele is used only when it is a must to
* correctly express phasing.
*/
static inline int bcf_format_gt1(const bcf_hdr_t *hdr, bcf_fmt_t *fmt, int isample, kstring_t *str)
Copy link
Member

@pd3 pd3 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suffix 1, as used here, is confusing and somewhat conflicting with the convention used throughout the rest of the code. Can you rename so that the 'phase' keyword is included? Or, even better, create an extended function where various flags can be turned on, starting with PHASE_1ST_ALLELE, or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name should not be made more specific, and if possible made more generic!

An implementation which supports the newer format should be the 1st choice going forward.
Best would have been to update the current method itself but not possible as it would break the compatibility due to signature change.
Hence chose a name which is in sync with current method. Also we have a few others like index_load/index_build/ etc in same lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the name to bcf_format_gt_v2

@jmarshall
Copy link
Member

Re bcf_get_version(), I think the “htslib way” would be to accept this notation regardless of input file VCF level — trusting the users only to use it where appropriate.

If it is decided to keep bcf_get_version(), I think it would be better to return the VCF level as an integer like 4005000 and avoid introducing an enum like bcf_version.

@vasudeva8
Copy link
Contributor Author

updated based on review discussions.
moved version to bcf_hdr_aux_t to cache it and avoid repeated retrieval from header
updated bcf_format_gt to use bcf_format_gt_v2
changed enumeration to an integer variable
uses 4.2 as the default, 4002000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTSlib does not support VCF v4.4's explicit first phasing indicator
4 participants