Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Dec 26, 2025

Add a way to expose metrics from the Docker image (SYNAPSE_ENABLE_METRICS).

Spawning from wanting to run a load test against the Complement Docker image of Synapse and see metrics from the homeserver.

Why not just provide your own homeserver config?

Probably possible but it gets tricky when you try to use the workers variant of the Docker image (docker/Dockerfile-workers). The way to workaround it would probably be to yq edit everything in a script and change /data/homeserver.yaml and /conf/workers/*.yaml to add the metrics listener. And then modify /conf/workers/shared.yaml to add enable_metrics: true. Doesn't spark much joy.

Testing strategy

  1. Make sure your firewall allows the Docker containers to communicate to the host (host.docker.internal) so they can access exposed ports of other Docker containers. We want to allow Synapse to access the Prometheus container and Grafana to access to the Prometheus container.
    • sudo ufw allow in on docker0 comment "Allow traffic from the default Docker network to the host machine (host.docker.internal)"
    • sudo ufw allow in on br-+ comment "(from Matrix Complement testing) Allow traffic from custom Docker networks to the host machine (host.docker.internal)"
    • Complement firewall docs
  2. Build the Docker image for Synapse: docker build -t matrixdotorg/synapse -f docker/Dockerfile . (docs)
  3. Generate config for Synapse:
    docker run -it --rm \
        --mount type=volume,src=synapse-data,dst=/data \
        -e SYNAPSE_SERVER_NAME=hs1 \
        -e SYNAPSE_REPORT_STATS=yes \
        -e SYNAPSE_ENABLE_METRICS=1 \
        matrixdotorg/synapse:latest generate
    
  4. Start Synapse:
    docker run -d --name synapse \
       --mount type=volume,src=synapse-data,dst=/data \
       -p 8008:8008 \
       -p 19090:19090 \
       matrixdotorg/synapse:latest
    
  5. You should be able to see metrics from Synapse at http://localhost:19090/_synapse/metrics
  6. Create a Prometheus config (prometheus.yml)
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: prometheus
        scrape_interval: 15s
        metrics_path: /_synapse/metrics
        scheme: http
        static_configs:
          - targets:
              # This should point to the Synapse metrics listener (we're using `host.docker.internal` because this is from within the Prometheus container)
              - host.docker.internal:19090
  7. Start Prometheus (update the volume bind mount to the config you just saved somewhere):
    docker run \
        --detach \
        --name=prometheus \
        -p 9090:9090 \
        -v ~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml \
        prom/prometheus
    
  8. Make sure you're seeing some data in Prometheus. On http://localhost:9090/query, search for synapse_build_info
  9. Start Grafana
    docker run -d --name=grafana --add-host host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
    
  10. Visit the Grafana dashboard, http://localhost:3000/ (Credentials: admin/admin)
  11. Connections -> Data Sources -> Add data source -> Prometheus
    • Prometheus server URL: http://host.docker.internal:9090
  12. Import the Synapse dashboard: https://github.com/element-hq/synapse/blob/develop/contrib/grafana/synapse.json

Dev notes

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

COPY ./docker/start.py /start.py
COPY ./docker/conf /conf

EXPOSE 8008/tcp 8009/tcp 8448/tcp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8009 was removed because it's ACME support was removed in matrix-org/synapse#10194

Comment on lines +849 to +856
# Keep the `shared_config` up to date with the `shared_extra_conf` from each
# worker.
shared_config = {
**worker_config["shared_extra_conf"],
# We combine `shared_config` second to avoid overwriting existing keys
# because TODO: why?
**shared_config,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out to #19323

Comment on lines +195 to +197
# SYNAPSE_ENABLE_METRICS=1). Metrics for workers are on ports starting from 19091 but
# since these are dynamic we don't expose them by default.
EXPOSE 19090/tcp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, I think it would be useful to add Prometheus service discovery endpoint to make it easy to discover all of the workers and random ports here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring this in #19336

# regardless of the SYNAPSE_LOG_LEVEL setting.
# * SYNAPSE_LOG_TESTING: if set, Synapse will log additional information useful
# for testing.
# * SYNAPSE_USE_UNIX_SOCKET: TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something for a future PR to address ⏩

Comment on lines +56 to +63
{% if SYNAPSE_ENABLE_METRICS %}
- type: metrics
# The main process always uses the same port 19090
#
# Prometheus does not support Unix sockets so we don't bother with
# `SYNAPSE_USE_UNIX_SOCKET`, https://github.com/prometheus/prometheus/issues/12024
port: 19090
{% endif %}
Copy link
Contributor Author

@MadLittleMods MadLittleMods Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole config situation for our Docker image is pretty confusing. This docker/conf/homeserver.yaml is only used for the migrate_config mode,

synapse/docker/start.py

Lines 218 to 230 in 7a24faf

# In generate mode, generate a configuration and missing keys, then exit
if mode == "generate":
return run_generate_config(environ, ownership)
if mode == "migrate_config":
# generate a config based on environment vars.
config_dir = environ.get("SYNAPSE_CONFIG_DIR", "/data")
config_path = environ.get(
"SYNAPSE_CONFIG_PATH", config_dir + "/homeserver.yaml"
)
return generate_config_from_template(
config_dir, config_path, environ, ownership
)

For the generate mode, we also have a bunch of changes to support SYNAPSE_ENABLE_METRICS (see synapse/config)

Comment on lines +34 to +50
def strtobool(val: str) -> bool:
"""Convert a string representation of truth to True or False
True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
are 'n', 'no', 'f', 'false', 'off', and '0'. Raises ValueError if
'val' is anything else.
This is lifted from distutils.util.strtobool, with the exception that it actually
returns a bool, rather than an int.
"""
val = val.lower()
if val in ("y", "yes", "t", "true", "on", "1"):
return True
elif val in ("n", "no", "f", "false", "off", "0"):
return False
else:
raise ValueError("invalid truth value %r" % (val,))
Copy link
Contributor Author

@MadLittleMods MadLittleMods Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from

def strtobool(val: str) -> bool:
"""Convert a string representation of truth to True or False
True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
are 'n', 'no', 'f', 'false', 'off', and '0'. Raises ValueError if
'val' is anything else.
This is lifted from distutils.util.strtobool, with the exception that it actually
returns a bool, rather than an int.
"""
val = val.lower()
if val in ("y", "yes", "t", "true", "on", "1"):
return True
elif val in ("n", "no", "f", "false", "off", "0"):
return False
else:
raise ValueError("invalid truth value %r" % (val,))

In docker/start.py, it doesn't seem like we have any dependencies on Synapse code so I just lifted it over here.

Comment on lines 26 to 38
listeners:
- port: 8008
- bind_addresses:
- ::1
- 127.0.0.1
port: 8008
resources:
- compress: false
names:
- client
- federation
tls: false
type: http
x_forwarded: true
Copy link
Contributor Author

@MadLittleMods MadLittleMods Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are generated from poetry run scripts-dev/generate_sample_config.sh

This has changed because we're now passing in a default set of listeners instead of raw string manipulation. See synapse/config/_base.py and synapse/config/server.py

@MadLittleMods MadLittleMods marked this pull request as ready for review December 31, 2025 00:17
@MadLittleMods MadLittleMods requested a review from a team as a code owner December 31, 2025 00:17
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the way we generate worker configs for testing is pretty convoluted. I'm also conscious that ESS is trying to do the same thing for production services, albeit with proprietary components.

Either way, thanks for this. Very useful option!

# regardless of the SYNAPSE_LOG_LEVEL setting.
# * SYNAPSE_LOG_TESTING: if set, Synapse will log additional information useful
# for testing.
# * SYNAPSE_USE_UNIX_SOCKET: TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# * SYNAPSE_USE_UNIX_SOCKET: TODO
# * SYNAPSE_USE_UNIX_SOCKET: TODO (https://github.com/prometheus/prometheus/issues/12024)

Let's add the blocker here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general doc comment for SYNAPSE_USE_UNIX_SOCKET.

prometheus/prometheus#12024 isn't relevant to link there.

@MadLittleMods MadLittleMods merged commit 9dae6cc into develop Jan 1, 2026
112 of 117 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/docker-metrics branch January 1, 2026 20:00
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @anoadragon453 🦋

MadLittleMods added a commit that referenced this pull request Jan 2, 2026
…configuration scripts (#19323)

For reference, this PR used to include this whole `shared_config` block  in the diff.

But #19324 was merged first which introduced parts of it already.

Here is what this code used to look like: https://github.com/element-hq/synapse/blob/566670c363915691826b5b435c4aa7acde61b408/docker/configure_workers_and_start.py#L865-L868

---

Original context for why it was changed this way:
matrix-org/synapse#14921 (comment)


Previously, this code made me question two things:

1. Do we actually use `worker_config["shared_extra_conf"]` in the
templates?
- At first glance, I couldn't see why we're updating `shared_extra_conf`
here. It's not used in the `worker.yaml.j2` template so all of this
seemed a bit pointless.
- Turns out, updating `shared_extra_conf` itself is pointless and it's
being done as a convenient place to mix the objects to get things right
in `shared_config` (confusing).
 1. Does it actually do anything?
- Because `shared_config` starts out as an empty object, my first glance
made me think we we're just updating with an empty object and then just
re-assigning. But because we're in a loop, we actually accumulate the
`shared_extra_conf` from each worker.

I'm not sure whether I'm capturing my confusion well enough here but
basically, this made me spend time trying to figure out what/why we're
doing things this way and we can use a more clear pattern to accomplish
the same thing.

---

This change is spawning from looking at the
`docker/configure_workers_and_start.py` script in order to add a metrics
listener ([upcoming
PR](#19324)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants