Skip to content

Conversation

@standage
Copy link
Member

@standage standage commented Apr 24, 2025

In this PR I'm updating the binder demo notebook. In the process, I changed the allele formatting from A|T|T|A to A:T:T:A to avoid confusion with conventional genetic notation for haplotype phases. (I would love to have dropped the separators altogether, but some legacy functions of the database still need to handle microhaps with indels correctly.)

I also found a bug with how non-1KGP allele frequencies were being renamed post-resolution of locus and allele definition identifiers. It only affected four allele definitions at two loci, and was resolved with a simple change to the build procedure. None of the standard 1KGP allele frequencies or Ae scores were affected.

  • mh05KK-023 --> mh05KK-023.v1
  • mh05KK-020 --> mh05KK-023.v2
  • mh05KK-120 --> mh05KK-120.v1
  • mh05KK-121 --> mh05KK-120.v2

Update: Actually, after running the new regression test on the master branch, I found three more affected loci—see comment below. As before, the 1KGP allele frequencies remain unaffected.


def __init__(self, name, rsids, index, xrefs=None, source=None):
self.name = Marker.check_name(name)
self.source_name = str(self.name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix part 1

Comment on lines -49 to +56
self.source_name_map[marker.source.name][marker.name] = self.definition_names[marker.posstr()]
self.source_name_map[marker.source.name][marker.source_name] = self.definition_names[marker.posstr()]
continue
else:
new_name = marker.name
if len(self.markers_by_definition) > 1:
new_name = f"{marker.name}.v{len(self.definition_names) + 1}"
self.definition_names[marker.posstr()] = new_name
self.source_name_map[marker.source.name][marker.name] = new_name
self.source_name_map[marker.source.name][marker.source_name] = new_name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix part 2

- 2413 distinct loci
[frequencies]
- 59753 haplotypes
- 59704 haplotypes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correcting for frequency records using deprecated marker identifiers

Comment on lines 130 to 135
def test_marker_names_valid():
freq_markers = set(microhapdb.frequencies.Marker)
markers = set(microhapdb.markers.Name)
invalid = freq_markers - markers
print(invalid)
assert len(invalid) == 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this regression test

@standage
Copy link
Member Author

standage commented Apr 24, 2025

Additional issues discovered after running the regression test on the master branch.

  • Three different allele definitions under the identifier mh01NK-001 (Staadig2021, Kidd2018|Turchi2019|Gandotra2020, Pakstis2021) were successfully merged into mh01NH-04 (Hiroaki2015). But weirdly all of the frequencies from these studies were renamed to mh01NH-01.v? instead of mh01NH-04.v?. This is resolved in this branch.
  • An allele definition under the identifier mh09KK-010 (Gandotra2020|Pakstis) was successfully merged into mh09USC-9pA, but the frequencies from Gandotra2020 were not renamed correctly. This branch corrects the issue.
  • Two allele definitions under the identifier mh22KK-340 (Gandotra2020|Nimagen2023, Pakstis2021) were successfully merged into mh22USC-22qB, but the frequencies from Gandotra2020 were not renamed correctly. This branch fixes the issue.

@standage standage merged commit ffcdaf3 into master Apr 24, 2025
4 checks passed
@standage standage deleted the docs branch April 24, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants