Mapping with vg

Table of Contents

Even though it would still fit the topic of bacterial pangenomics, for the sake of clarity I’m starting a new post to test how vg performs when mapping sequencing reads to genome graphs. I’ve already done test runs with Pandora and I will use the same isolates for the first tests with vg as well.

Goals

  • map forward reads of CH3797
  • map forward reads of the 27 biofilm example isolates
  • compare to Pandora results
    • check gene presence/absence
    • variant calling
  • visualise the isolate paths in the graph

Mapping to the five references pangenome

As far as I know, vg (like the traditional mappers I know) can only map one sample at a time to the reference graph, so I’ll start with the single isolate I also used for Pandora before running the same command 27 times for the isolate set from the biofilm proteome paper.

Reference preparation

The simplest way to start mapping reads to a graph in vg is using vg map -d idxbase -f in1.fq [-f in2.fq] > aln.gam - notice how using paired end reads is possible here! Since I only copied the fastq files containing the first read pairs to my working directories, though, I’ll work with “single end” data for now.
I will try using the annotated version of my reference graph first:

cd /data3/genome_graphs/CPANG/playground/day3/references
vg index -g FivePsaeAnnotAll.gcsa -k 16 FivePsaeAnnotAll.vg
Found kmer with offset >= 1024. GCSA2 cannot handle nodes greater than 1024 bases long. To enable indexing, modify your graph using `vg mod -X 256 x.vg >y.vg`. CGATGCCCATGCCGCC   1130873:1024    A       G       1130873:1040

Ah yes, so far I’ve been very happy that I didn’t have to modify my graph, because I struggled to understand the implications at the course (although it sounded like there were none). I’ll just try doing what the error message says…

vg mod -X 256 FivePsaeAnnotAll.vg > FivePsaeAnnotAll_mod.vg
vg index -g FivePsaeAnnotAll.gcsa -k 16 FivePsaeAnnotAll_mod.vg
DiskIO::write(): Write failed
DiskIO::write(): You may have run out of temporary disk space at /tmp

OK, this is not getting any better, is it? Luckily, pruning graphs is also part of the day 3 instructions/hints. Following this, I will remove high-degree nodes (with many edges) and sort my graph before generating the xg index and then prune the graph for the GCSA index.

vg mod -X 32 FivePsaeAnnotAll.vg | vg mod -M 8 - | vg sort - > FivePsaeAnnotAll_mod.vg
vg index -x FivePsaeAnnotAll_mod.xg FivePsaeAnnotAll_mod.vg
vg prune -k 16 -e 3 FivePsaeAnnotAll_mod.vg > FivePsaeAnnotAll_prune.vg
vg index -g FivePsaeAnnotAll_mod.gcsa -k 16 FivePsaeAnnotAll_prune.vg

While this runs, here’s an overview of what’s happening: vg mod -X 32 chops the nodes so that they are at most 32 nucleotides long, then vg mod -M 8 unlinks nodes with more than eight edges to reduce complexity. The graph is then sorted and saved before the xg index is created with the next command.
vg prune -k 16 -e 3 prunes complex regions in the graph, using 16 as k-mer length and three as a maximum number of edges per k-mer. Pruning also removed embedded paths - should I not have used the annotated version of my reference graph? Is the annotation even still there?

vg paths -L -x FivePsaeAnnotAll_mod.xg
vg paths -L -v FivePsaeAnnotAll_prune.vg

vg paths -L lists all paths embedded in the graph. The modified graph still contains the annotations, but the pruned one does not. How will that influence the mapping results?

Mapping a single sample

OK, I have the indices ready - let’s map!

cd ../../vg/
vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsaeAnnotAll_mod -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz > CH3797_R1.vs.FivePsaeAnnotAll.gam

There are multiple way to use these mapping results. On day two we calculated the mean read identity after mapping, which could probably be interesting here as well. I also tried augmenting my graph with the long reads and to create a pileup, but this wasn’t successful. Instead, I think I will try vg surject and vg vectorize to learn more about how the mapping went. Both methods were part of the day two lecture - vg surject creates BAM files from the mapping, while vg vectorize lists which reads touch which nodes in the graph. I wonder if one of these methods can be used to create a gene presence/absence list, or if vg pack (which calculates graph coverage) is a better choice for that.

I’ll start with the “simple” question of mean read identity.

vg view -a CH3797_R1.vs.FivePsaeAnnotAll.gam | jq .identity | awk '{i+=$1; n+=1} END {print i/n}'

The mean read identity is 0.886578, which is not as much as I had hoped for.

Before I go on to test other vg commands, I’ll work my way through some of the options to vg map. There is for example one to surject the output directly into the graph’s paths to create a BAM file, and an option to output a table with information about each read (name, chr, pos, mq, score). Since there is also an option to exclude reads with no alignment, I assume that unmapped reads will be part of this table (and possibly, somehow, the GAM file?).

vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsaeAnnotAll_mod -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz --surject-to bam -v > CH3797_R1.vs.FivePsaeAnnotAll_sur_v.gam
mv CH3797_R1.vs.FivePsaeAnnotAll_sur_v.gam CH3797_R1.vs.FivePsaeAnnotAll.tab

Since the only output from this is the table with the reads, I assume I’d have to call the vg map command at least twice to get all the information I want - once for the GAM and once for this table. I can always run vg surject afterwards on the GAM, assuming that the result will be the same. So, how useful is this table?

It lists all reads (I think) by name, followed by either an empty cell or the name of the path that was hit (I used the annotated version for mapping, so there are genes as well as whole genomes included as paths). Then the mapping position (in relation to what?), the quality and the score of the mapping. There are a lot of empty “chromosome” entries, which is a little irritating considering that the graph is based on the five reference genomes and each node should be annotated with at least one of those. Or does this happen when there is more than one path? The mapping position is also weird, since it’s always (?) 0 when there is no chromosome assigned. Since there is still a mapping score and quality, I assume that some mapping took place nevertheless, I just don’t know where. How do I find unmapped reads, though?

vg surject -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsaeAnnotAll_mod.xg -b CH3797_R1.vs.FivePsaeAnnotAll.gam > CH3797_R1.vs.FivePsaeAnnotAll.bam
Segmentation fault (core dumped)

Despite the error, a BAM file was created… Maybe it’s not complete. Let’s see if I can extract unmapped reads from it.

samtools view -f 4 CH3797_R1.vs.FivePsaeAnnotAll.bam > unmapped.sam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error 4 after 2 of 4 bytes
[E::bam_hdr_read] Error reading BGZF stream
[main_samview] fail to read the header from "CH3797_R1.vs.FivePsaeAnnotAll.bam".

No SAM file was created, and it indeed looks like the BAM file was truncated. Is this a simple question of memory, or another problem? Can I solve this with using more threads?

vg surject -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsaeAnnotAll_mod.xg -t 5 -b CH3797_R1.vs.FivePsaeAnnotAll.gam > CH3797_R1.vs.FivePsaeAnnotAll.bam

That also stops working after a few minutes. I can’t really believe it’s the memory, though, since I included the surject argument in vg map before and that worked. Are the two ways to generate that BAM file so different?
Anyway, let me try that again…

vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsaeAnnotAll_mod -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz --surject-to bam > CH3797_R1.vs.FivePsaeAnnotAll.bam
Segmentation fault (core dumped)

Ah, I guess the -v overwrote the --surject-to. I wonder if the paths I have included in the graph (all the annotations!) contribute to the memory problem. I could just as well map to the non-annotated graph, as long as I get the information to which nodes the reads mapped out afterwards.

Reference preparation and mapping without annotation

To keep things comparable, I’ll use the same commands I used for the annotated graph to index and prepare the simpler graph as well.

cd /data3/genome_graphs/CPANG/playground/day3/references
gunzip FivePsae.vg.gz
vg mod -X 32 FivePsae.vg | vg mod -M 8 - | vg sort - > FivePsae_mod.vg &
vg index -x FivePsae_mod.xg FivePsae_mod.vg &
vg prune -k 16 -e 3 FivePsae_mod.vg > FivePsae_prune.vg &
vg index -g FivePsae_mod.gcsa -k 16 FivePsae_prune.vg &

Time for a quick size comparison of the created files (with and without annotation): the modified graph with annotation is 46M big, without annotation it’s just 31M. The indices are even bigger: 219M versus 153M with and without annotation, respectively. The pruned graph is 11M in both versions, since the paths were removed anyway. The same is true for the 40M GCSA indices.

cd ../../vg/
vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz > CH3797_R1.vs.FivePsae.gam

The command never gave me back my console, only after I hit Enter, and there were no status, warning or error messages whatsoever… A GAM file was generated, though, so let’s check that out by surjecting it to BAM format.

vg surject -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod.xg -b CH3797_R1.vs.FivePsae.gam > CH3797_R1.vs.FivePsae.bam
error:[Surjector] could not identify path position of surjected alignment M01340:25:000000000-ABNVH:1:1101:22664:18005

Ah, a new error message! What does this one mean? Is the generated BAM file truncated again? Well, most likely, since it is very small…

samtools flagstat CH3797_R1.vs.FivePsae.bam

EOF marker is absent and no reads are included in the file, so definitely broken. The read mentioned in the error message should have mapped to the graph, at least it is listed in CH3797_R1.vs.FivePsaeAnnotAll.tab (the tab separated list of reads I generated for the annotated graph). It is one of the entries with a 0 for mapping position and I really hope that that is not the problem. Maybe there was something wrong with the mapping? After all, it was a bit strange…

I can’t find anything about this kind of error on GitHub or Biostars, but I’ll try updating vg to the newest version just in case there was a bug that has been fixed (I still have vg 1.19, while the latest version if vg 1.22).

vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz > CH3797_R1.vs.FivePsae.gam
warning [libhandlegraph]: Serialized handle graph does not appear to match deserialzation type.
warning [libhandlegraph]: It is either an old version or in the wrong format.
warning [libhandlegraph]: Attempting to load it anyway. Future releases will reject it!
warning:[XG] Loading an out-of-date XG format.For better performance over repeated loads, consider recreating this XG index.

While the process did seem to run, I prefer to re-create the indices as well, just to be sure.

cd /data3/genome_graphs/CPANG/playground/day3/references
rm FivePsae_mod.*
rm FivePsae_prune.vg
vg mod -X 32 FivePsae.vg | vg mod -M 8 - | vg sort - > FivePsae_mod.vg
gzip FivePsae.vg
terminate called after throwing an instance of 'std::runtime_error'
  what():  [io::ProtobufIterator] tag "HashGraph" for Protobuf that should be "VG"
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_2xOBkY/stacktrace.txt
Please include the stack trace file in your bug report!

OK, since I don’t know what that error means or how good the backwards compatibility is between the data I have and the new vg version, I’ll create a new graph to start again. It’s hopefully not necessary, but it takes time to get a reply on Biostars and I don’t want to wait to be told that this is what I have to do.

Re-creation of the five reference graph

Since I created the graph with minimap2 and seqwish, the only vg-related step I have to repeat is the conversion from GFA to vg format with vg view:

gunzip FivePsae.gfa.gz
vg view -F FivePsae.gfa -v > FivePsae_new.vg
gzip FivePsae.gfa

That seems to have gone well, there was no feedback. The new file has the same size as the old file, let’s see how the modifications work now, one step at a time.

vg mod -X 32 FivePsae_new.vg > FivePsae_32.vg
vg mod -M 8 FivePsae_32.vg > FivePsae_32_8.vg
vg sort FivePsae_32_8.vg > FivePsae_mod_new.vg
terminate called after throwing an instance of 'std::runtime_error'
  what():  [io::ProtobufIterator] tag "HashGraph" for Protobuf that should be "VG"
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_2xOBkY/stacktrace.txt
Please include the stack trace file in your bug report!

It’s the sorting that doesn’t work, it throws the error right away. The reason for that is the move to a new output format (HashGraph) which renders sorting the graph unnecessary. I don’t know if the switch to HashGraphs also means I should keep the new version of my graph, or if the old one is enough, so I’m sticking to the new one for now.

rm FivePsae_32.vg FivePsae_32_8.vg FivePsae_mod_new.vg
vg mod -X 32 FivePsae_new.vg | vg mod -M 8 - > FivePsae_mod_new.vg
vg index -x FivePsae_mod_new.xg FivePsae_mod_new.vg
vg: /data3/genome_graphs/vg-v1.22.0/include/sdsl/select_support_mcl.hpp:349: sdsl::select_support::size_type sdsl::select_support_mcl<t_bit_pattern, t_pattern_len>::select(sdsl::select_support::size_type) const [with unsigned char t_b = 1u; unsigned char t_pat_len = 1u; sdsl::select_support::size_type = long unsigned int]: Assertion `i > 0 and i <= m_arg_cnt' failed.
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_2jk34E/stacktrace.txt
Please include the stack trace file in your bug report!

Well, that didn’t last long… I do have to create an index, though, so this is not due to some unnecessary call. The options for vg index don’t seem to have changed, so what could be causing this? I can’t find anything very helpful in the release documentation either, so maybe I should say goodbye to the course materials and use the commands from the vg Wiki instead. In the basic operations document from 2018, they use the graph as is for the xg index and only prune for the gcsa.

vg index -x FivePsae_new.xg FivePsae_new.vg
vg prune -r FivePsae_new.vg > FivePsae_prune_new.vg
vg index -g FivePsae_new.gcsa -k 16 FivePsae_prune_new.vg
Found kmer with offset >= 1024. GCSA2 cannot handle nodes greater than 1024 bases long. To enable indexing, modify your graph using `vg mod -X 256 x.vg >y.vg`. GGGCGAGGCGTTGAAC     1062065:1024    G   G1062065:1040

The xg index was created without problems, but GCSA is struggling with the command copied from the Wiki. I’ll have to modify again.

vg mod -X 256 FivePsae_new.vg > FivePsae_mod_new.vg
vg prune -r FivePsae_mod_new.vg > FivePsae_prune_new.vg
vg index -g FivePsae_new.gcsa -k 16 FivePsae_prune_new.vg
DiskIO::write(): Write failed
DiskIO::write(): You may have run out of temporary disk space at /tmp

Well, this was obviously not enough modification or pruning. This also happened when I first tried preparing the graph for mapping. Maybe I should try the modifications I used before, but only for the gcsa index?

vg mod -X 32 FivePsae_new.vg | vg mod -M 8 - > FivePsae_mod_new.vg
vg prune -r FivePsae_mod_new.vg > FivePsae_prune_new.vg
vg: /data3/genome_graphs/vg-v1.22.0/include/sdsl/select_support_mcl.hpp:349: sdsl::select_support::size_type sdsl::select_support_mcl<t_bit_pattern, t_pattern_len>::select(sdsl::select_support::size_type) const [with unsigned char t_b = 1u; unsigned char t_pat_len = 1u; sdsl::select_support::size_type = long unsigned int]: Assertion `i > 0 and i <= m_arg_cnt' failed.
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_HY35mt/stacktrace.txt
Please include the stack trace file in your bug report!

Nope, it looks like neither the indexing nor the pruning can work with this kind of modified graph. How broken is that graph, and why?

vg stats -z FivePsae_mod_new.vg
nodes   1265264
edges   1644925

This seems to work fine, so hopefully the graph itself isn’t broken. I’ll have to wait for a solution - for now I’m giving my graph to the developers to have a look.

tar -czvf FivePsae_new.vg.tar.gz FivePsae_new.vg

And I’ll see if the same pipeline works for one of the day 1 to make sure it’s the graph and not a problem with the new installation.

cd ../../day1/sim_reads
mkdir vg_1.22_test
vg construct -r /data3/genome_graphs/vg-v1.22.0/test/1mb1kgp/z.fa -v /data3/genome_graphs/vg-v1.22.0/test/1mb1kgp/z.vcf.gz > vg_1.22_test/z.vg
warning:[vg::Constructor] Unsupported variant allele "<CN0>"; Skipping variant(s) z     13790   BI_GS_DEL1_B1_P2734_15 T       <CN0>   100     PASS    AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CIEND=-75,76;CIPOS=-75,76;CS=DEL_union;DP=15541;EAS_AF=0;END=1017354;EUR_AF=0.001;NS=2504;SAS_AF=0;SVLEN=-3562;SVTYPE=DEL !
warning:[vg::Constructor] Unsupported variant allele "<INS:ME:ALU>"; Skipping variant(s) z      14169 ALU_umary_ALU_12010      G       <INS:ME:ALU>    100     PASS    AC=43;AF=0.00858626;AFR_AF=0.0303;AMR_AF=0.0043;AN=5008;CS=ALU_umary;DP=18053;EAS_AF=0;EUR_AF=0;MEINFO=AluUndef,2,281,+;NS=2504;SAS_AF=0;SVLEN=279;SVTYPE=ALU;TSD=null !
warning:[vg::Constructor] Unsupported variant allele "<CN2>"; Skipping variant(s) z     490168  DUP_gs_CNV_20_1490168_1549769  G       <CN2>   100     PASS    AC=1;AF=0.000199681;AFR_AF=0;AMR_AF=0;AN=5008;CS=DUP_gs;DP=21874;EAS_AF=0;END=1549769;EUR_AF=0;NS=2504;SAS_AF=0.001;SVTYPE=DUP !

Well, at least the old warning messages are still there. Would my indexing pipeline work with this graph?

cd /vg_1.22_test
vg mod -X 32 z.vg | vg mod -M 8 - > z_mod.vg
vg index -x z_mod.xg z_mod.vg
vg prune -k 16 -e 3 z_mod.vg > z_prune.vg
vg index -g z_mod.gcsa -k 16 z_prune.vg

No problems whatsoever, so it is related to the graph and not the installation.

The problem is that vg mod -M 8 removes nodes and edges, but not the paths that pass through them, so we end up with empty nodes and vg can’t work with that (anymore) - it sounds a little like the “invalid edge” warnings I get when extracting sub-graphs. The solution would be either to remove all paths, or to see if chopping nodes with vg mod -X 32 and then pruning is enough to create the gcsa index. I’d prefer the latter, but since it’s possible to extract all nodes touched by a path, I could also live with the first possibility.

cd /data3/genome_graphs/CPANG/playground/day3/references
vg mod -X 32 FivePsae_new.vg > FivePsae_mod_new.vg
vg index -x FivePsae_mod_new.xg FivePsae_mod_new.vg
vg prune -k 16 -e 3 FivePsae_mod_new.vg > FivePsae_prune_new.vg
vg index -g FivePsae_mod_new.gcsa -k 16 FivePsae_prune_new.vg

No error or warning messages this time! Let’s see how the mapping goes.

cd ../../vg/
vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz > CH3797_R1.vs.FivePsae.gam

This looks good, the resulting gam file is about as big as the first one I got using the annotated graph (before all the problems started). Let’s see if I can get a bam file out of it.

vg surject -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg -b CH3797_R1.vs.FivePsae.gam > CH3797_R1.vs.FivePsae.bam
terminate called after throwing an instance of 'std::runtime_error'
  what():  error:[Surjector] could not identify path position of surjected alignment M01340:25:000000000-ABNVH:1:1105:9803:6739
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_bcZ5pb/stacktrace.txt
Please include the stack trace file in your bug report!

That error message came almost instantly, but the creation of the bam file had already started. This is also the same error message (though for a different alignment) that I got before from vg surject - which made me update vg and all. So no change to the “original” problem.

Oh well, I can still compare the read identity to the one I had with the older version of vg (should be the same, right?), and generate the list of reads and how they mapped.

vg view -a CH3797_R1.vs.FivePsae.gam | jq .identity | awk '{i+=$1; n+=1} END {print i/n}'
vg map -d /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new -f /data3/genome_graphs/sample_data/CH3797_R1.fastq.gz -v > CH3797_R1.vs.FivePsae.tab

The mean read identity actually improved a tiny little bit, from 0.886578 to 0.889595, I guess the mapping algorithm got better between versions. The list of reads with their mapping statistics contains 734237 reads, same as before. Doing a simple diff on both files is no use, apparently the order of reads is different or something. I’ll have a look at them in R.

These “refpos” tables still confuse me. The reads are the same, but why do so many reads not have a “chromosome” (path?) assigned? And why are mapping quality and score not always the same? The sequence is the same, only the annotation is different in the reference, right? Let’s ask odgi for some stats on that…

conda activate python3.7
cd ../day3/references/new_odgi
odgi stats -i FivePsae.og -S
length: 8545811
nodes:  1104940
edges:  1489544
paths:  5

It doesn’t work with the original FivePsae.og, since I updated odgi and it looks like they also switched to the new graph format there, so I had to use the newly created one.
The annotated graph will need to be converted to GFA first:

cd ..
vg view -g FivePsaeAnnotAll.vg > FivePsaeAnnotAll.gfa
odgi build -g FivePsaeAnnotAll.gfa -o - | odgi sort -i - -o FivePsaeAnnotAll.og
odgi stats -i FivePsaeAnnotAll.og -S
length: 8545811
nodes:  1128326
edges:  1512930
paths:  30578

Right, so the different annotation forces a change in number of nodes and edges - it looks like the sequence is being split up further. Maybe that influences the mapping? If it does, it does so in both directions - sometimes mapping quality is better in the original graph, sometimes it’s better in the extensively annotated one. Overall, this is a rare occurrence, though - most reads show no difference at all (98.9%).

Also interesting is that the differences in mapping quality and score do not appear in the same reads. For the score, 98.5% of the reads have identical values, and only 27 reads (0.003%) have a changed score and mapping quality. All other changes only happen in either the quality or the score, but not in both.

While I don’t understand the cause for these differences (it could also just be the updated version of vg?), I assume that they are the cause for the slight difference in mean read identity I calculated above.

Unmapped reads

As long as I can’t create a BAM file to work with, I’ll have to make do with the refpos tables and the GAM files. Maybe the tables are enough to have a look at unmapped reads? As far as I understood, they should be part of the tables and the GAM, since I didn’t use the --exclude-unaligned option, but I’m not sure how to identify them in the tables, and I can’t read the GAM directly. The tables have reads with 0 mapping quality, but in standard mappers that means the read mapped to multiple locations, not necessarily that it didn’t map at all, and I’m not sure what the meaning of the scores is. The materials for day 3 state this in the hints:

“You may want to filter alignments to only keep those with a positive score (that are mapped).”

So that means only reads with a positive score are mapped? That would mean that 97.5% of reads mapped to the graph(s), as there is only a two read difference between the two versions when looking at reads with scores bigger than 0. Including 0 in the count of successfully mapped reads leads to 99.9% mapped reads. That would be very good news indeed.

So how does this compare to other mapping methods? Did the unmapped reads also not map to other references? Sadly, Pandora does not return a list of reads (yet), so I can only compare with the original mapping to PA14.

samtools view -f 4 CH3797.bam > /data3/genome_graphs/CPANG/playground/vg/CH3797_PA14_unmapped.sam
cut -f1 CH3797_PA14_unmapped.sam | sort | uniq > CH3797_PA14_unmapped.txt

There are 85457 unmapped reads in the original BAM file, and only 629 I could identify here, but of course that was paired-end mapping and I don’t have the mates included here. Comparing the read names results in absolutely no overlap, though. Apparently, the 629 reads that didn’t map to my graph did map to the PA14 reference with stampy. How well did they map, then?

samtools view CH3797.bam | grep -F -w -f /data3/genome_graphs/CPANG/playground/vg/CH3797_R1.vs.FivePsae.unmapped.txt

This was supposed to extract the reads from the original BAM file, but there was no output, because the read name list I generated with R contained a space after every read. I corrected that manually.

Scrolling through the reads, most of them seem to actually be unmapped in the original mapping as well. Wait, is the space part of the name in my dataframe in R? Yes, yes it is. Apparently the file isn’t actually tab separated, but space separated and there’s one too much in my read name column… Luckily this is easy to resolve with the strip.white=TRUE option in read.table(). So now I have an overlap of 365 reads that didn’t map to my graph and neither to PA14 with stampy.

Nevertheless, that means I have 264 reads that mapped to the PA14 reference, but not to my graph. This has to be due to quality issues.

samtools view CH3797.bam | grep -F -w -f /data3/genome_graphs/CPANG/playground/vg/unmapped.mappedOrigin.txt > /data3/genome_graphs/CPANG/playground/vg/CH3797_graph_unmapped.sam

I took the sequence of the last read in the SAM file and blasted it against the Pseudomonas aeruginosa group, resulting in 100% identity hits in a lot of strains. For PA14, the matching region is 6460393 to 6460692, which I can find and analyse in IVG. This time, I want to add the reads to the graphical view:

/data3/genome_graphs/sequenceTubeMap/scripts/prepare_gam.sh CH3797_R1.vs.FivePsae.gam
Generating index from gam file

To do that, I have to index my GAM file and make sure that I use the right xg index for visualisation (“FivePsae_mod_new.xg”). The IVG directory comes with a “scripts” folder that contains a bash script to generate my GAM index file for me, but basically all it does is run vg gamsort for me. In order to be able to view the file, I have to copy it to the directory where IVG is looking for data.

cp CH3797_R1.vs.FivePsae.sorted.gam.gai CH3797_R1.vs.FivePsae.sorted.gam ../day3/references/

Now it takes a little longer to load the graph, but this is the result:

Reads mapping to a region in the PA14 reference genome Reads mapping to a region in the PA14 reference genome (click to enlarge)

Forward reads are coloured in red here, reverse are in blue. The green path that moves in and out is reference strain PA7, and the isolate I’m looking at obviously does not share these variants. On the other hand, I can see a clear variation that is not part of the references around position 6460450.

The weird thing about this is that the sequence shown in the nodes is not the sequence of the read that I was blasting, so I don’t know what I’m looking at. In the original mapping to PA14, this read mapped to the reverse strand, but even with the reverse complement, I can’t find the sequence in that region of the graph. I think the problem is that the BLAST uses a different PA14 genome sequence than I used (the NCBI version is 6,536,590 nucleotides long, but my annotation says my genome is 6,537,648 nucleotides long), so the genomic range is more or less useless. Except that it isn’t when I add the 1058 nucleotides difference to the original position. At 6461451 I can find the sequence of my read just fine:

Reads mapping to another region in the PA14 reference genome Reads mapping to another region in the PA14 reference genome (click to enlarge)

So why didn’t it map to this region, then? There are enough reads there, and it even supports a SNP exchanging a C for a G at a later position (not shown in the above screenshot) that is not present in any of the references. The reads that are shown here, did they all map correctly? The mapping score of “my” read was -78, but the mapping quality was 60 (the highest I can filter for in IVG)… I wish they would show the read names in IVG!

Well, I assume that “my” read is not shown, since it has two SNPs compared to PA14, and while one is shared by all other reads present, the other one is not even shown in the visualisation - so the read is not shown and I still don’t know why it didn’t map. (I do, of course - the score was too low, so the real question is about the score calculation.)

t.b.c.

Gene presence/absence

Identification of gene presence and absence is a little more tricky here compared to Pandora. The first thing I can think of to do is using vg vectorize/odgi bin to see which reads map to which nodes, maybe generalise that to a general set of nodes that were hit during the mapping, and compare that to nodes which are known to belong to gene paths. Alternatively, it could be possible to look at specific genes of interest by extracting subgraphs, augmenting them with reads, and then visualising that.

While I think odgi bin might be the better option, I’ll try with vg vectorize first, since that means no conversions of data.

cd /data3/genome_graphs/CPANG/playground/vg/
vg vectorize -x ../day3/references/FivePsae_mod_new.xg CH3797_R1.vs.FivePsae.gam > CH3797_R1.vs.FivePsae.vector

This takes some time, the output is huge and, somehow, not what I had expected after the example. The size is to be expected, with more than a million nodes, listing them all as columns in a table is quite something, and then add 730k something reads as rows. The thing is, they are not actually columns, just long lines of numbers. And there are no headers or row names - completely different from the example in the course slides. With this, I can’t figure out which line belongs to which read (if I wanted to cross-check the unmapped reads), but it should be enough to create a map of touched nodes to find the path of the sample through the graph.
Nevertheless, I stopped the process after about two hours, when the file was 136 GB in size, with 115,385 lines (reads). This is definitely too much to do for the whole thing.

For odgi bin (which I expect will be faster than vg vectorize, but I don’t know what the output will look like), I have to convert my GAM file to a file that odgi can read. I think I have to start with vg augment to embed the reads in the graph, then I should be able to convert that to GFA with vg view. Augmenting is also important for variant calling, so I will need that anyway. I’m just wondering if that will create one path for the sample, or one for each read…

vg augment /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.vg -i CH3797_R1.vs.FivePsae.gam > CH3797_R1.vs.FivePsae.vg
vg view -g CH3797_R1.vs.FivePsae.vg > CH3797_R1.vs.FivePsae.gfa
conda activate python3.7
odgi build -g CH3797_R1.vs.FivePsae.gfa -o - | odgi sort -i - -o CH3797_R1.vs.FivePsae.og
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
terminate called after throwing an instance of 'std::runtime_error'
  what():  error: Serialized handle graph does not match deserialzation type.
Aborted (core dumped)

Apparently that is not a good strategy, since odgi doesn’t want to work with the converted graph. I had planned to check the number of paths with odgi as a next step, but it looks like I’d better use vg to see what’s inside the augmented graph.

vg paths -L -v CH3797_R1.vs.FivePsae.vg

Note: Never forget to specify the input format for vg paths, there will be no warning and no output otherwise.
Sadly, vg paths gives me a long list of paths - I now have one for each read. Maybe this is good for variant calling, but I would be very happy with a consensus-like merged path for the sample, especially for visualisation.

Variant calling

Since I’ve already augmented the reference graph with my reads while looking for gene presence/absence information, I can now do the variant calling without any further preparation steps.

vg call -s CH3797 CH3797_R1.vs.FivePsae.vg > CH3797_R1.vs.FivePsae.vcf
error [vg call]: pack file (-k) is required

Interesting. The Wiki is obviously very old, as it doesn’t mention this requirement (and it contains options for vg call that don’t exist anymore), so I didn’t see this coming… The help makes it sound like I need to supply a pack file, which I would have to create first. vg pack creates “graph coverage vectors”, which might actually be what I was looking for earlier - an indication which nodes were hit by my sample.

vg pack -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg -g CH3797_R1.vs.FivePsae.gam -o CH3797_R1.vs.FivePsae.pack
vg call -k CH3797_R1.vs.FivePsae.pack -s CH3797 CH3797_R1.vs.FivePsae.vg > CH3797_R1.vs.FivePsae.vcf
vg: /data3/genome_graphs/vg-v1.22.0/include/sdsl/vlc_vector.hpp:172: sdsl::vlc_vector<t_coder, t_dens, t_width>::value_type sdsl::vlc_vector<t_coder, t_dens, t_width>::operator[](sdsl::vlc_vector<t_coder, t_dens, t_width>::size_type) const [with t_coder = sdsl::coder::elias_delta; unsigned int t_dens = 128u; unsigned char t_width = 0u; sdsl::vlc_vector<t_coder, t_dens, t_width>::value_type = long unsigned int; sdsl::vlc_vector<t_coder, t_dens, t_width>::size_type = long unsigned int]: Assertion `i < m_size' failed.
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_2kUxbr/stacktrace.txt
Please include the stack trace file in your bug report!

vg pack is actually very fast, it only takes a few seconds and returns a compressed file of “coverage packs”, but vg call still refuses to work. I’m starting to feel that vg has problems with my data, even though I can’t understand why.

I’ll go back to vg pack and have a look at other options, maybe there’s something there that could help with gene presence/absence. There is, for example, -d which writes a table representing packs to stdout.

vg pack -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg -g CH3797_R1.vs.FivePsae.gam -o CH3797_R1.vs.FivePsae.pack -d > CH3797_R1.vs.FivePsae.tsv

This is great! Judging from the header (it has a header! / “seq.pos - node.id - node.offset - coverage”) I believe this is a list of sequences (don’t know which kind) with the nodes they were found in, and their coverage. From this, I can calculate node coverage and compare that with the gene paths.

First things first: “sequences” are single nucleotides. According to odgi stats, my graph is 8,545,811 nucleotides long, and that’s also the number of rows in the table generated by vg pack.
If I group the data by node ID in R, I get 1,265,264 rows, which means I have more nodes now than were in my reference graph; I assume the mapping split some up even further.
There are 458,419 nodes with no coverage, containing 2,448,700 nucleotides (28.6% of the graph). In addition, there are 17,560 nodes which are only partly covered. These confuse me a little, as it often seems to be just one of multiple nucleotides that is not covered. In fact, 12,198 of the nodes (69.5%) only have one nucleotide that is not covered, followed by 1442 with two, 728 with three etc. The maximum number of not covered nucleotides is 31 in 58 nodes.

Afterthought: those 12k single nucleotides with no coverage are single variants that are not part of any of the five references. All I need is the information which nucleotide is represented most in the reads if it’s not the one(s) the references have.

Back to gene presence/absence

Great, so this is my way into the gene presence/absence question. I know exactly which nodes are covered with how many reads for each nucleotide position. What I need now is the information which node belongs to which gene. This could possibly be a jq question - I could use the JSON formatted tree to extract all node IDs for all paths. I mean, I did that before for GAM files of sub-graphs, why shouldn’t it work on the whole graph?

On the other hand, the data would probably be more usable in a database format, so I can easily query both ways - to which gene(s) does this node belong, and which nodes does this gene contain.

In any case, I’ll start with converting my graph to a pretty JSON format using vg view piped to jq (for the pretty print). Then I can have a look at the general structure to figure out how to move from there.

vg view FivePsaeAnnotAll.vg -j | jq '.' > FivePsaeAnnotAll.json
jq '{name: .name, nodes: ([.path[].mapping[].position.node_id | tostring] | join(","))}' FivePsaeAnnotAll.json
jq --stream '.[]' FivePsaeAnnotAll.json > test.json

note: Try python ijson

conda activate python3.7
sudo -i
conda install -c conda-forge ijson
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /usr/bin/miniconda3/envs/python3.7

  added / updated specs:
    - ijson


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.4.5.1         |   py37hc8dfbb8_0         151 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         151 KB

The following NEW packages will be INSTALLED:

  ijson              conda-forge/noarch::ijson-3.0-pyh9f0ad1d_0

The following packages will be UPDATED:

  ca-certificates                     2019.11.28-hecc5488_0 --> 2020.4.5.1-hecc5488_0
  certifi                         2019.11.28-py37hc8dfbb8_1 --> 2020.4.5.1-py37hc8dfbb8_0
  openssl                                 1.1.1e-h516909a_0 --> 1.1.1g-h516909a_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
certifi-2020.4.5.1   | 151 KB    | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
exit
touch json2sql.py

Back to variant calling

Looking through the course materials, I finally figured out what the problem was with my vg call call: I used the augmented graph instead of the original (i.e. the index I used for mapping).

vg call /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg -k CH3797_R1.vs.FivePsae.pack -s CH3797 > CH3797_R1.vs.FivePsae.vcf

This results in a new set of error messages and an empty vcf file, so it’s not much more helpful, but it’s at least a reason why it didn’t work before.

error[RepresentativeTraversalFinder]: Node 428266 is on backbone path at 19 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "428265"}, "parent": {"end": {"backward": true, "node_id": "428264"}, "start": {"backward": true, "node_id": "428270"}}, "start": {"backward": true, "node_id": "428267"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 377277 is on backbone path at 7 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "377276"}, "parent": {"end": {"backward": true, "node_id": "377275"}, "start": {"backward": true, "node_id": "377283"}}, "start": {"backward": true, "node_id": "377278"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 406498 is on backbone path at 15 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "406497"}, "parent": {"end": {"node_id": "406501"}, "start": {"node_id": "406495"}}, "start": {"backward": true, "node_id": "406499"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 149338 is on backbone path at 24 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "149337"}, "parent": {"end": {"node_id": "149341"}, "start": {"node_id": "149336"}}, "start": {"backward": true, "node_id": "149339"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 486228 is on backbone path at 7 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "486227"}, "parent": {"end": {"node_id": "486232"}, "start": {"node_id": "486225"}}, "start": {"backward": true, "node_id": "486229"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 336408 is on backbone path at 33 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "336407"}, "parent": {"end": {"backward": true, "node_id": "1265562"}, "start": {"backward": true, "node_id": "1276121"}}, "start": {"backward": true, "node_id": "336409"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 273287 is on backbone path at 25 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "273286"}, "parent": {"end": {"node_id": "273291"}, "start": {"node_id": "273285"}}, "start": {"backward": true, "node_id": "273288"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 103025 is on backbone path at 11 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "103024"}, "parent": {"end": {"node_id": "103029"}, "start": {"node_id": "103023"}}, "start": {"backward": true, "node_id": "103026"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 122241 is on backbone path at 7 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "122240"}, "parent": {"end": {"node_id": "122246"}, "start": {"node_id": "1259904"}}, "start": {"backward": true, "node_id": "122242"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 264508 is on backbone path at 15 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "264507"}, "parent": {"end": {"backward": true, "node_id": "264506"}, "start": {"backward": true, "node_id": "264513"}}, "start": {"backward": true, "node_id": "264509"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
terminate called recursively
terminate called recursively
terminate called after throwing an instance of 'terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
std::runtime_error'
error[RepresentativeTraversalFinder]: Node 204888 is on backbone path at 37 but not traced in site {"directed_acyclic_net_graph": true, "end": {"backward": true, "node_id": "204887"}, "parent": {"end": {"node_id": "204892"}, "start": {"node_id": "204886"}}, "start": {"backward": true, "node_id": "204889"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
Segmentation fault (core dumped)

The error message is very specific and repetitive. It says that some paths are traversing parts of my graph twice, which I could already see in IVG. It also mentions that this is relevant in the path I am calling against. Since I didn’t specify one, the expected default is that all five paths were used.
Let’s check some of the nodes listed here for a better picture, since the node IDs are not ones I associate with the loopy region I know.

The first node mentioned in the error message (428266) is at the right side of the region displayed by IVG, since the node IDs are counting backwards here. It’s the lower “T” at the position with a different “T” or a “G”, which is touched by all paths except NC_009656.1 (PA7).

Node 428266 Node 428266 (the lower “T” at the position where also a different “T” or a “G” are possible)

Next up is node 377277, the second node in the picture (the upper “T” this time). This one is also touched by all except PA7.

Node 377277 Node 377277 (the “T” at the second position where a “G” is also possible)

Node 406498 is in a reverse region again. This time it’s the “G” and not the “T”, and it is touched by every path.

Node 406498 Node 406498 (the “G” at the second to last position where a “T” or another “G” is also possible)

This is the first time that I can actually see a path going through the node twice, but I can’t tell if this is a visualisation error or visualising the true problem.

Looking through the other nodes, they all seem pretty much the same: 336408, 273287, 103025, and 122241 are all touched by all paths except PA7, all without visible path duplications, all with two alternative alleles. Node 264508 is different as it is the only one that is touched by all except PA14, and node 204888 is touched by all paths, even twice by PA14 and PAO1:

Node 204888 Node 204888 (the second “G”, where a “C” is also possible)

Overall, these node all appear in regions with multiple loops and are mostly not part of the most confusing structures. Why those nodes? And why are they only listed once when they are touched by multiple paths so this error could happen multiple times per node? Let’s see what happens when I select PA7 as path to call variants against, as the reference strain which touched the least nodes of those problematic ones.

vg call /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg -k CH3797_R1.vs.FivePsae.pack -s CH3797 -p "refseq|NC_009656.1|chromosome" > CH3797_R1.vs.PA7.vcf
error[RepresentativeTraversalFinder]: Node 765479 is on backbone path at 10 but not traced in site {"directed_acyclic_net_graph": true, "end": {"node_id": "136547"}, "parent": {"end": {"node_id": "136554"}, "start": {"node_id": "136544"}}, "start": {"node_id": "136545"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 769621 is on backbone path at 6 but not traced in site {"directed_acyclic_net_graph": true, "end": {"node_id": "146325"}, "parent": {"end": {"backward": true, "node_id": "146322"}, "start": {"backward": true, "node_id": "146326"}}, "start": {"node_id": "146323"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
error[RepresentativeTraversalFinder]: Node 786638 is on backbone path at 12 but not traced in site {"directed_acyclic_net_graph": true, "end": {"node_id": "185940"}, "parent": {"end": {"node_id": "185941"}, "start": {"node_id": "185936"}}, "start": {"node_id": "185938"}, "start_end_reachable": true, "type": 1}
error[RepresentativeTraversalFinder]: This can happen when the path you are calling against traverses the same part of your graph twice.
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
  what():  Extra ref node found
terminate called recursively
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_A5ytI3/stacktrace.txt
Please include the stack trace file in your bug report!

Ok, this is basically the same problem, but it ends differently. And wait until you see the stacktrace file:

Crash report for vg v1.22.0 "Rotella"
Stack trace (most recent call last) in thread 31202:
#13   Object "", at 0xffffffffffffffff, in
#12   Object "", at 0x7f3e1ccc441c, in
#11   Object "", at 0x7f3e1f52c6b9, in
#10   Object "", at 0x7f3e1d1ac43d, in
#9    Object "", at 0xddb6d2, in
#8    Object "", at 0xde349c, in
#7    Object "", at 0x7f3e1cf97486, in
#6    Object "", at 0x7f3e1cf96f82, in
#5    Object "", at 0x7f3e1d44c004, in
#4    Object "", at 0x7f3e1d44b6a8, in
#3    Object "", at 0x7f3e1d44c6b5, in
#2    Object "", at 0x7f3e1d44e84c, in
#1    Object "", at 0x7f3e1cbf4029, in
#0    Object "", at 0x7f3e1cbf2428, in

What is that supposed to tell anyone?

Whatever it is, there are also three new nodes to look at: 765479, 769621, and 786638, this time using PA7 as reference.

Node 765479 Node 765479 Node 769621 Node 769621 Node 786638 Node 786638

I tried to mark the nodes of interest with “yellow marker”, but I think node 786638 is still difficult to see with all the other stuff going on right beside it - it’s the smaller one of the two second to last nodes, only touched by PA7.

No matter how many nodes I look at, I can’t figure out why only some of them cause the problems and others don’t. Maybe IVG isn’t the right tool for this kind of question.

vg find -n 786638 -x /data3/genome_graphs/CPANG/playground/day3/references/FivePsae_mod_new.xg | vg view -j -
{"node": [{"id": "786638", "sequence": "A"}], "path": [{"mapping": [{"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": "786638"}, "rank": "1"}], "name": "refseq|NC_009656.1|chromosome"}]}

That looks pretty normal to me…

vg find -n 786638 -x CH3797_R1.vs.FivePsae.vg | vg view -j -

Using a graph with vg find takes a lot longer than using the xg index (or maybe that’s because it’s the augmented version).

{"node": [{"id": "786638", "sequence": "A"}], "path": [{"mapping": [{"edit": [{"from_length": 1, "to_length": 1}], "position": {"node_id": "786638"}, "rank": "1"}], "name": "refseq|NC_009656.1|chromosome"}]}

Still, the result is the same.

Open Questions

  • What is going on with vg surject when a path position could not be identified?
  • Where to get the best workflows/pipelines to use? Are there any good, up-to-date tutorials?
  • How is the mapping influenced by the number of nodes and edges?
  • How can I find unmapped reads?
  • How can I get the variant calling to work?
comments powered by Disqus