Project at Harvard Medical School: Haplotype-aware de novo assembly of related individuals

Scientific question . Humans are diploid, and hence there exist two versions of each chromosome, one inherited from the mother and the other from the father. Determining the DNA sequences of these two chromosomal copies—called haplotypes —is important for many applications ranging from population history to clinical questions . Existing sequencing technologies cannot read a chromosome from start to end, but instead deliver small pieces of sequence (called reads ). Like in a jigsaw puzzle, the underlying genome sequences are reconstructed from the reads by finding the overlaps between them. However, current standard approaches cannot produce the sequences of both haplotypes but “collapse” them to obtain one consensus sequence. We develop algorithms to solve the genome assembly for diploids, that is, “to simultaneously solve two jigsaw puzzles with very similar yet different images”. Furthermore, we want to incorporate the pedigree information in the underlying model to generate diploid assemblies for related individuals. We will apply this method on trio from GIAB, whose two individuals are from PGP. At the application side, the main question is how much read data is required for related individuals as opposed to single individual.

Harvard Medical School is looking for interns to work on this diploid assembly project. For further details, view the position description: Diploid genome assembly of related individuals

View all posts