Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution of Teloclip and input output streams #26

Open
ufaroooq opened this issue Aug 8, 2024 · 2 comments
Open

Execution of Teloclip and input output streams #26

ufaroooq opened this issue Aug 8, 2024 · 2 comments

Comments

@ufaroooq
Copy link

ufaroooq commented Aug 8, 2024

Dear Adam,

I hope you are doing well. I just got to know about this tool to find telomers in assembled genomes. I Am trying to run this on one of my assembled genomes but the documentation in readme file is confusing me.
I also checked a previously opened issue #20 by @cyycyj but stat also increased some confusions so I will try to put all confusions in one go so yu can help.
As from the Documentation I see 3 major steps to follow

Step1: First index the reference assembly

samtools faidx ref.fa

Step2: Streaming SAM records from aligner

minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. minimap2 -ax map-ont ref.fa pacbio.fq.gz > mapped.sam
2. samtools view -h -F 0x100 mapped.sam > FILTERED.SAM
3. teloclip --refIdx ref.fa.fai filtered.sam > teloclip_filtered.sam
5. samtools sort teloclip_filtered.sam -o TELOCLIP_FILTERED.BAM

Step 3: Report clipped alignments containing target motifs

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam

Step by step as mentioned by @cyycyj in #20

1. samtools view -h in.bam > header.sam    #convert bam to sam including header information
2. teloclip --ref ref.fa.fai --motifs TTAGGG header.sam > teloclip_TTAGGG_overhangs.sam
3. samtools sort teloclip_TTAGGG_overhangs.sam > teloclip_TTAGGG_overhangs.bam

QUESTION: This in.bam is causig confusion. as @cyycyj had asked already in #20

  • This in.bam will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion) file ?
  • This in.bam will be the final output from step 2 being the TELOCLIP_FILTERED.BAM file ?

QUESTION regarding Matching noisy target motifs*
when using the --fuzzy option with teloclip, i get error teloclip: error: unrecognized arguments: --fuzzy

Step 4: Extract clipped reads

samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

QUESTION: here the in.bam will be which of the following ?

  • This in.bam will be the output from step 2 being the FILTERED.SAM (after sam2bam conversion) file ?
  • This in.bam will be the final output from step 2 being the TELOCLIP_FILTERED.BAM file ?
  • This in.bam will be the final output from step 3 being the TELOCLIP_TTAGGG_OVERHANGS.BAM file ?

Major Confusions 1 in #20 you answered as

" the bam or sam that gets passed to teloclip should always be either raw aligments from an alignment tool like minimap2 OR alignments that have been filtered with samtools view to remove low quality alignments."

This is point of confusion, So in both step 2,3,4 the in.bam file should be the either the raw alignment file mapped.sam or filteres alignment file FILTERED.SAM which are generated in step2. can you please shed some light on this.

Major Confusions 2

all these steps should be executed to run teloclip ??

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > teloclip_overhangs.bam
Step3: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | samtools sort > teloclip_TTAGGG_overhangs.bam
Step4: samtools view -h in.bam | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

OR step 2 3 4 are just different ways to run teloclip and can be combined into one step as below ?

Step1: samtools faidx ref.fa
Step2: minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

If you are able to help with the mwntioned questions it will be great help.
Best regards

@Adamtaranto
Copy link
Owner

In response to your last question, yes these are all just different ways to chain the steps together.

In your example:

# Index the draft genome assembly
samtools faidx ref.fa

# Map ONT reads to reference with minimap2
# Filter out secondary alignments with samtools
# Find reads that overhang contig ends AND contain TTAGGG motif with teloclip
# Write the overhang reads to fasta files with teloclip-extract
minimap2 -ax map-ont ref.fa nanopore.fq | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai --motifs TTAGGG | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

This will create a dir called SplitOverhangs that has one fasta file of overhang reads for each of your contigs/chromosomes. The overhanging section of each read will be in lower case. You can manually select the best alignment (long high identity anchor on the contig end + longest run of telomeric repeats in the overhang) for each contig and append the overhang.

Chaining all the steps together like this means that you do not write any of the intermediate files to disk.

However, I generally recommend that you run the initial teloclip step with and without the --motifs filter and manually inspect the overhanging alignments in IGV.

You will see that there are different kinds of overhangs depending on how complete your assembly is.

  • Some overhangs may contain unassembled rDNA repeats that are near the end of a chromosome but before the telomeres.
  • Fragmented assemblies will have lots of non-telomeric overhangs where the contig has been broken at a repeat.
  • Mitochondrial contigs will always have LOTS of overhangs because they are circular and high copy. You should identify these and NOT extend them with overhanging reads.
  • Sometimes you will get chance single matches to the telomeric repeat, you should manually inspect the overhangs and decide if it looks real.
  • Sometimes it is not possible to identify telomeric repeats in very noisy ONT reads, in which case you should run teloclip without the --motifs option and manually select the overhangs that contain telomeres.

Re the --fuzzy option this is still in development and will replace the --noPoly option. Ignore it for now.

Please use the stable release that is on PyPi via pip install teloclip

@ufaroooq
Copy link
Author

ufaroooq commented Aug 9, 2024

Dear Adam,

Thank you for your detailed responce. It clerifies alot of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants