-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I am having some trouble understanding how the motif search algorithm works with the --known-motif option in modkit version 0.6.0.
First, we ran modkit motif search against a bedMethyl file, without pre-specifying any target motifs.
Command
modkit motif search -i ${bedmethyl} -r ${ref}Output:
+-------------------+------------+------------+-----------+-----------+
| motif | frac_mod | high_count | low_count | mid_count |
+-------------------+------------+------------+-----------+-----------+
| G[a]NTC | 0.9956344 | 14368 | 63 | 2719 |
| GN[m]CTGGNC | 0.931127 | 1041 | 77 | 408 |
| GN[m]CTGGY | 0.91173244 | 1663 | 161 | 706 |
| GA[a]TCNG | 0.90301 | 810 | 87 | 897 |
| A[m]CTGGT | 0.9028133 | 353 | 38 | 125 |
| GGCW[a]TNNNNNG | 0.90179265 | 1157 | 126 | 551 |
| GGCD[a]TNC | 0.89689034 | 2740 | 315 | 1521 |
| KGA[a]TC | 0.8873343 | 1205 | 153 | 1379 |
| GGCW[a]TY | 0.8872549 | 1991 | 253 | 1234 |
| GA[a]TCNNR | 0.88256484 | 1225 | 163 | 1411 |
| ANNNNGA[a]TC | 0.8817204 | 492 | 66 | 510 |
| GGNNNNGNCW[a]T | 0.8772093 | 943 | 132 | 513 |
| SNNNGGNCA[a]T | 0.8767651 | 683 | 96 | 479 |
| GA[a]TCNNNNC | 0.86887115 | 762 | 115 | 921 |
| GA[a]TCNNNNNG | 0.8684531 | 713 | 108 | 865 |
| RNNNGA[a]TC | 0.8659649 | 1234 | 191 | 1517 |
| GA[a]TCNNNNNNNNG | 0.864899 | 685 | 107 | 879 |
| SNNNNNNNNNGA[a]TC | 0.8643552 | 1421 | 223 | 1793 |
| GGGCN[a]T | 0.8640933 | 2149 | 338 | 1173 |
| GA[a]TCNNNNNNK | 0.86117136 | 1191 | 192 | 1499 |
| GA[a]TCNNNNNNNNNR | 0.85863096 | 1154 | 190 | 1507 |
| SNNNNNGA[a]TC | 0.8549093 | 1367 | 232 | 1761 |
+-------------------+------------+------------+-----------+-----------+
From this output, we identified (visually and based on literature for this species) a smaller number of shorter motifs, each being a subset of one or more of the longer motifs. For example: GA[a]TC. We then went back and reran the motif search using the --known-motif option, expecting that modkit would just search directly for GA[a]TC, regardless of the surrounding genome context, and report the number observations for just that motif. However, that seems not to be the case.
Instead, it seems like modkit is still performing a naive search first to find all motifs, and then comparing the resulting list against the --known-motif. Which is fine I suppose, except that several motifs that seem to me to be proper subsets of the collection of sequences containing GA[a]TC are being called "Disjoint". For example:
Command:
modkit motif search --known-motif GAATC 2 a -i ${bedmethyl} -r ${ref}Output (subset of full output):
+-------------------+------------+------------+-----------+-----------+----------+---------------------+
| motif | frac_mod | high_count | low_count | mid_count | status | closest_known_motif |
+-------------------+------------+------------+-----------+-----------+----------+---------------------+
| GA[a]TCNG | 0.90301 | 810 | 87 | 897 | Disjoint | GA[a]TC |
| GA[a]TCNNR | 0.88256484 | 1225 | 163 | 1411 | Disjoint | GA[a]TC |
| ANNNNGA[a]TC | 0.8817204 | 492 | 66 | 510 | Disjoint | GA[a]TC |
| GA[a]TCNNNNC | 0.86887115 | 762 | 115 | 921 | Disjoint | GA[a]TC |
| GA[a]TCNNNNNG | 0.8684531 | 713 | 108 | 865 | Disjoint | GA[a]TC |
| RNNNGA[a]TC | 0.8659649 | 1234 | 191 | 1517 | Disjoint | GA[a]TC |
| GA[a]TCNNNNNNNNG | 0.864899 | 685 | 107 | 879 | Disjoint | GA[a]TC |
| SNNNNNNNNNGA[a]TC | 0.8643552 | 1421 | 223 | 1793 | Disjoint | GA[a]TC |
| GA[a]TCNNNNNNK | 0.86117136 | 1191 | 192 | 1499 | Disjoint | GA[a]TC |
| GA[a]TCNNNNNNNNNR | 0.85863096 | 1154 | 190 | 1507 | Disjoint | GA[a]TC |
| SNNNNNGA[a]TC | 0.8549093 | 1367 | 232 | 1761 | Disjoint | GA[a]TC |
+-------------------+------------+------------+-----------+-----------+----------+---------------------+
As you can see, each entry has the target motif listed as the closest_known_motif, and in each case, the target motif GA[a]TC is a substring of the full motif. But modkit is classifying each motif as "disjoint" with the target motif. Whereas, it seems to me that either
GA[a]TCshould be a subset of each motif in column 1, if the term "subset" is being used with respect to the string, i.e. meaning a "substring"; orGA[a]TCshould be a superset of each motif in column 1, in the sense that the collection of sequences containing the full motif is a proper subset of the collection of sequences that containGA[a]TC.
In any case, I cannot think of what other definition of subset/superset would result in calling these motifs "disjoint" with GA[a]TC.
Given all the above, the questions I have are:
- Can you explain how
modkitdesignates a motif as either a subset, superset, or disjoint? - Is there an option or combination of options that will make
modkitdo a direct search, rather than a naive search and post hoc comparison? For example, this doc mentions a--skip-searchoption. Will that do what I want? - Is there a better/different way to use
modkitto accomplish what I'm trying to do?