Skip to content

Help understanding motif search with --known-motif #553

@ndusek

Description

@ndusek

I am having some trouble understanding how the motif search algorithm works with the --known-motif option in modkit version 0.6.0.

First, we ran modkit motif search against a bedMethyl file, without pre-specifying any target motifs.

Command

modkit motif search -i ${bedmethyl} -r ${ref}

Output:

+-------------------+------------+------------+-----------+-----------+
| motif             | frac_mod   | high_count | low_count | mid_count |
+-------------------+------------+------------+-----------+-----------+
| G[a]NTC           | 0.9956344  | 14368      | 63        | 2719      |
| GN[m]CTGGNC       | 0.931127   | 1041       | 77        | 408       |
| GN[m]CTGGY        | 0.91173244 | 1663       | 161       | 706       |
| GA[a]TCNG         | 0.90301    | 810        | 87        | 897       |
| A[m]CTGGT         | 0.9028133  | 353        | 38        | 125       |
| GGCW[a]TNNNNNG    | 0.90179265 | 1157       | 126       | 551       |
| GGCD[a]TNC        | 0.89689034 | 2740       | 315       | 1521      |
| KGA[a]TC          | 0.8873343  | 1205       | 153       | 1379      |
| GGCW[a]TY         | 0.8872549  | 1991       | 253       | 1234      |
| GA[a]TCNNR        | 0.88256484 | 1225       | 163       | 1411      |
| ANNNNGA[a]TC      | 0.8817204  | 492        | 66        | 510       |
| GGNNNNGNCW[a]T    | 0.8772093  | 943        | 132       | 513       |
| SNNNGGNCA[a]T     | 0.8767651  | 683        | 96        | 479       |
| GA[a]TCNNNNC      | 0.86887115 | 762        | 115       | 921       |
| GA[a]TCNNNNNG     | 0.8684531  | 713        | 108       | 865       |
| RNNNGA[a]TC       | 0.8659649  | 1234       | 191       | 1517      |
| GA[a]TCNNNNNNNNG  | 0.864899   | 685        | 107       | 879       |
| SNNNNNNNNNGA[a]TC | 0.8643552  | 1421       | 223       | 1793      |
| GGGCN[a]T         | 0.8640933  | 2149       | 338       | 1173      |
| GA[a]TCNNNNNNK    | 0.86117136 | 1191       | 192       | 1499      |
| GA[a]TCNNNNNNNNNR | 0.85863096 | 1154       | 190       | 1507      |
| SNNNNNGA[a]TC     | 0.8549093  | 1367       | 232       | 1761      |
+-------------------+------------+------------+-----------+-----------+

From this output, we identified (visually and based on literature for this species) a smaller number of shorter motifs, each being a subset of one or more of the longer motifs. For example: GA[a]TC. We then went back and reran the motif search using the --known-motif option, expecting that modkit would just search directly for GA[a]TC, regardless of the surrounding genome context, and report the number observations for just that motif. However, that seems not to be the case.

Instead, it seems like modkit is still performing a naive search first to find all motifs, and then comparing the resulting list against the --known-motif. Which is fine I suppose, except that several motifs that seem to me to be proper subsets of the collection of sequences containing GA[a]TC are being called "Disjoint". For example:

Command:

modkit motif search --known-motif GAATC 2 a -i ${bedmethyl} -r ${ref}

Output (subset of full output):

+-------------------+------------+------------+-----------+-----------+----------+---------------------+
| motif             | frac_mod   | high_count | low_count | mid_count | status   | closest_known_motif |
+-------------------+------------+------------+-----------+-----------+----------+---------------------+
| GA[a]TCNG         | 0.90301    | 810        | 87        | 897       | Disjoint | GA[a]TC             |
| GA[a]TCNNR        | 0.88256484 | 1225       | 163       | 1411      | Disjoint | GA[a]TC             |
| ANNNNGA[a]TC      | 0.8817204  | 492        | 66        | 510       | Disjoint | GA[a]TC             |
| GA[a]TCNNNNC      | 0.86887115 | 762        | 115       | 921       | Disjoint | GA[a]TC             |
| GA[a]TCNNNNNG     | 0.8684531  | 713        | 108       | 865       | Disjoint | GA[a]TC             |
| RNNNGA[a]TC       | 0.8659649  | 1234       | 191       | 1517      | Disjoint | GA[a]TC             |
| GA[a]TCNNNNNNNNG  | 0.864899   | 685        | 107       | 879       | Disjoint | GA[a]TC             |
| SNNNNNNNNNGA[a]TC | 0.8643552  | 1421       | 223       | 1793      | Disjoint | GA[a]TC             |
| GA[a]TCNNNNNNK    | 0.86117136 | 1191       | 192       | 1499      | Disjoint | GA[a]TC             |
| GA[a]TCNNNNNNNNNR | 0.85863096 | 1154       | 190       | 1507      | Disjoint | GA[a]TC             |
| SNNNNNGA[a]TC     | 0.8549093  | 1367       | 232       | 1761      | Disjoint | GA[a]TC             |
+-------------------+------------+------------+-----------+-----------+----------+---------------------+

As you can see, each entry has the target motif listed as the closest_known_motif, and in each case, the target motif GA[a]TC is a substring of the full motif. But modkit is classifying each motif as "disjoint" with the target motif. Whereas, it seems to me that either

  1. GA[a]TC should be a subset of each motif in column 1, if the term "subset" is being used with respect to the string, i.e. meaning a "substring"; or
  2. GA[a]TC should be a superset of each motif in column 1, in the sense that the collection of sequences containing the full motif is a proper subset of the collection of sequences that contain GA[a]TC.

In any case, I cannot think of what other definition of subset/superset would result in calling these motifs "disjoint" with GA[a]TC.

Given all the above, the questions I have are:

  • Can you explain how modkit designates a motif as either a subset, superset, or disjoint?
  • Is there an option or combination of options that will make modkit do a direct search, rather than a naive search and post hoc comparison? For example, this doc mentions a --skip-search option. Will that do what I want?
  • Is there a better/different way to use modkit to accomplish what I'm trying to do?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionLooking for clarification on inputs and/or outputs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions