์ฑ ์์๋ Airflow, NiFi, PostgreSQL, Elasticsearch, Kibana, Kafka, Spark ๋ฑ ๋ชจ์กฐ๋ฆฌ ๋ค ๋ก์ปฌ ํ๊ฒฝ์์ ์ค์นํด์ ์ค์ตํ๋ค. ํ์ง๋ง ์ด๋ ๋ด๊ฐ ์์ฃผ ์ซ์ดํ๋ ์ํฉ์ด๋ฏ๋ก ๋น์ฐํ๊ฒ Docker๋ฅผ ํ์ฉํด์ ํ๊ฒฝ์ ๊ตฌ์ถํ๋ค.
| Software / hardware | OS requirements |
|---|---|
| Python | 3.12.8 |
| Nifi | apache/nifi:1.28.0 |
| PostgreSQL | postgres:13 |
| ElasticSearch | elasticsearch:7.17.28 |
| Kibana | kibana:7.17.28 |
app-py3.12 โ {seilylook} ๐ make build
==============================================
Exporting Python dependencies to requirements.txt...
==============================================
poetry export -f requirements.txt --output requirements.txt --without-hashes --with dev
Warning: poetry-plugin-export will not be installed by default in a future version of Poetry.
In order to avoid a breaking change and make your automation forward-compatible, please install poetry-plugin-export explicitly. See https://python-poetry.org/docs/plugins/#using-plugins for details on how to install a plugin.
To disable this warning run 'poetry config warnings.export false'.
==============================================
Building Docker image python-app:latest...
==============================================
docker build -t python-app:latest .
[+] Building 1.7s (20/20) FINISHED docker:desktop-linux
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.14kB 0.0s
=> [internal] load metadata for docker.io/library/python:3.12-slim 1.5s
=> [auth] library/python:pull token for registry-1.docker.io 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 33.38kB 0.0s
=> [builder 1/5] FROM docker.io/library/python:3.12-slim@sha256:aaa3f8cb64dd64e5f8cb6e58346bdcfa410a108324b0f28f1a7cc5964355b211 0.0s
=> CACHED [stage-1 2/10] RUN apt-get update && apt-get install -y --no-install-recommends default-jdk procps wget libpq5 0.0s
=> CACHED [stage-1 3/10] WORKDIR /app 0.0s
=> CACHED [builder 2/5] WORKDIR /app 0.0s
=> CACHED [builder 3/5] RUN apt-get update && apt-get install -y --no-install-recommends build-essential libpq-dev python3-dev 0.0s
=> CACHED [builder 4/5] COPY requirements.txt . 0.0s
=> CACHED [builder 5/5] RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt 0.0s
=> CACHED [stage-1 4/10] COPY --from=builder /app/wheels /wheels 0.0s
=> CACHED [stage-1 5/10] COPY --from=builder /app/requirements.txt . 0.0s
=> CACHED [stage-1 6/10] RUN pip install --no-cache /wheels/* 0.0s
=> [stage-1 7/10] COPY src/ src/ 0.0s
=> [stage-1 8/10] COPY tests/ tests/ 0.0s
=> [stage-1 9/10] COPY data/ data/ 0.0s
=> [stage-1 10/10] COPY conf/ conf/ 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:a7003da4b1bb9a04c9d98b7a386e9c28c84efe83df2ae20e99abc16cedf9e3fb 0.0s
=> => naming to docker.io/library/python-app:latest 0.0s
View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/4go6dhhdb6ahlu1m1xmndgsou
What's next:
View a summary of image vulnerabilities and recommendations โ docker scout quickview
==============================================
Constructing Docker Containers...
==============================================
docker compose up -d
WARN[0000] The "AIRFLOW_UID" variable is not set. Defaulting to a blank string.
WARN[0000] The "AIRFLOW_UID" variable is not set. Defaulting to a blank string.
WARN[0000] /Users/seilylook/Development/Book/Data_Engineering_with_Python/docker-compose.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Running 10/10
โ Container postgres Healthy 4.3s
โ Container elasticsearch Healthy 3.6s
โ Container redis Healthy 4.3s
โ Container kibana Running 0.0s
โ Container airflow-init Exited 7.8s
โ Container python-app Started 3.8s
โ Container airflow-triggerer Running 0.0s
โ Container airflow-webserver Running 0.0s
โ Container airflow-scheduler Running 0.0s
โ Container airflow-worker Running 0.0s
==============================================
Waiting for PostgreSQL to start...
==============================================
=====================================
Initializing PostgreSQL...
=====================================
chmod +x ./scripts/init_postgresql.sh
./scripts/init_postgresql.sh
dataengineering ๋ฐ์ดํฐ๋ฒ ์ด์ค ์์ฑ ์ค...
ERROR: database "dataengineering" already exists
Successfully copied 2.05kB to postgres:/tmp/create_tables.sql
ํ
์ด๋ธ ์์ฑ ์ค...
psql:/tmp/create_tables.sql:10: NOTICE: relation "users" already exists, skipping
CREATE TABLE
List of relations
Schema | Name | Type | Owner
--------+-------+-------+---------
public | users | table | airflow
(1 row)
Table "public.users"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('users_id_seq'::regclass)
name | character varying(100) | | not null |
street | character varying(200) | | |
city | character varying(100) | | |
zip | character varying(10) | | |
lng | numeric(10,6) | | |
lat | numeric(10,6) | | |
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
๊ถํ ๋ถ์ฌ ์ค...
GRANT
๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํ
์ด๋ธ์ด ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค.
๋ฐ์ดํฐ๋ฒ ์ด์ค ์ฐ๊ฒฐ ํ
์คํธ ์ค...
/var/run/postgresql:5432 - accepting connections
๋ฐ์ดํฐ๋ฒ ์ด์ค๊ฐ ์ ์์ ์ผ๋ก ์๋ตํฉ๋๋ค.
==============================================
Waiting for Elasticsearch to start...
==============================================
=====================================
Initializing Elasticsearch...
=====================================
chmod +x ./scripts/init_elasticsearch.sh
./scripts/init_elasticsearch.sh
Elasticsearch๊ฐ ์ค๋น๋ ๋๊น์ง ๋๊ธฐ ์ค...
Elasticsearch๊ฐ ์ค๋น๋์์ต๋๋ค.
users ์ธ๋ฑ์ค ์์ฑ ์ค...
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [users/AW57UPmxTYyW3G-GdL6lHw] already exists","index_uuid":"AW57UPmxTYyW3G-GdL6lHw","index":"users"}],"type":"resource_already_exists_exception","reason":"index [users/AW57UPmxTYyW3G-GdL6lHw] already exists","index_uuid":"AW57UPmxTYyW3G-GdL6lHw","index":"users"},"status":400}users ์ธ๋ฑ์ค๊ฐ ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค.
์ธ๋ฑ์ค ๋ชฉ๋ก ํ์ธ:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open .internal.alerts-transform.health.alerts-default-000001 MSc-VAFyQHG9tGk2TxwbGg 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.logs.alerts-default-000001 Bief08evQ_SX_3UrwZOzwQ 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.uptime.alerts-default-000001 N-ptFAoNTYeF7OHGtOqZFw 1 0 0 0 249b 249b 249b
green open .internal.alerts-ml.anomaly-detection.alerts-default-000001 JL9JINenTS-XMJClxKXrYA 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.slo.alerts-default-000001 rFwL_h62QZmQx8kkXFZIyA 1 0 0 0 249b 249b 249b
green open .internal.alerts-default.alerts-default-000001 6xFM1NqvTc--9BJ7MlhCpA 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.apm.alerts-default-000001 D1xM5G2DTPeF25kkR6_KWQ 1 0 0 0 249b 249b 249b
green open users AW57UPmxTYyW3G-GdL6lHw 1 0 1000 0 229.1kb 229.1kb 229.1kb
green open .internal.alerts-observability.metrics.alerts-default-000001 IO4Li-8sS6-AH8aSsvEqLg 1 0 0 0 249b 249b 249b
green open .internal.alerts-ml.anomaly-detection-health.alerts-default-000001 95sA140IQ2qCyErtmHKTEg 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.threshold.alerts-default-000001 TV3IWC9fQUuYWQO-56_5sA 1 0 0 0 249b 249b 249b
green open .internal.alerts-security.alerts-default-000001 VE-qcLiURZy7h_i1hb96Lw 1 0 0 0 249b 249b 249b
green open .internal.alerts-stack.alerts-default-000001 _PEhTLb4TJ2bYAwiv6kz8w 1 0 0 0 249b 249b 249bapp-py3.12 {seilylook} ๐ make test
=======================
Running tests with pytest...
=======================
mkdir -p target
docker run --rm -v /Users/seilylook/Development/Book/Data_Engineering_with_Python/app/target:/app/target python-app:latest /bin/bash -c \
'for test_file in $(find tests -name "*.py" ! -name "__init__.py"); do \
base_name=$(basename $test_file .py); \
pytest $test_file --junitxml=target/$base_name.xml; \
done'
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 7 items
tests/test_progress_bar.py ....... [100%]
------------ generated xml file: /app/target/test_progress_bar.xml -------------
============================== 7 passed in 0.06s ===============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 9 items
tests/test_data_generator.py ......... [100%]
----------- generated xml file: /app/target/test_data_generator.xml ------------
============================== 9 passed in 0.03s ===============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 0 items
----------------- generated xml file: /app/target/conftest.xml -----------------
============================ no tests ran in 0.00s =============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 7 items
tests/test_postgres_connector.py ....... [100%]
--------- generated xml file: /app/target/test_postgres_connector.xml ----------
============================== 7 passed in 0.20s ===============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 9 items
tests/test_database_config.py ......... [100%]
----------- generated xml file: /app/target/test_database_config.xml -----------
============================== 9 passed in 0.02s ===============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 14 items
tests/test_database_pytest.py .............. [100%]
----------- generated xml file: /app/target/test_database_pytest.xml -----------
============================== 14 passed in 0.19s ==============================
============================= test session starts ==============================
platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0
rootdir: /app
plugins: Faker-36.1.1, time-machine-2.16.0, anyio-4.8.0
collected 8 items
tests/test_elasticsearch_connector.py ........ [100%]
------- generated xml file: /app/target/test_elasticsearch_connector.xml -------
============================== 8 passed in 0.18s ===============================app-py3.12 {seilylook} ๐ make start
=========================
Starting the application...
=========================
python -m src.main
2025-02-28 17:34:59,269 - root - INFO - ๋ฐ์ดํฐ ์์ฑ ๋ฐ ์ ์ฅ ํ๋ก์ธ์ค ์์
2025-02-28 17:34:59,269 - root - INFO - ๋ฐ์ดํฐ์
์ด ์ด๋ฏธ ์กด์ฌํฉ๋๋ค: data/raw/test_data.csv
2025-02-28 17:34:59,285 - src.utils.connection - INFO - PostgreSQL ์ฐ๊ฒฐ ์ฑ๊ณต!
2025-02-28 17:34:59,286 - src.utils.connection - INFO - Elasticsearch ํด๋ผ์ด์ธํธ ์์ฑ ์๋ฃ
2025-02-28 17:34:59,290 - elastic_transport.transport - INFO - GET http://localhost:9200/ [status:200 duration:0.004s]
2025-02-28 17:34:59,290 - src.utils.connection - INFO - Elasticsearch ์ฐ๊ฒฐ ์ฑ๊ณต! ๋ฒ์ : 8.17.2
2025-02-28 17:34:59,290 - root - INFO - Postgresql ์ํ: ์ฐ๊ฒฐ๋จ
2025-02-28 17:34:59,290 - root - INFO - Elasticsearch ์ํ: ์ฐ๊ฒฐ๋จ
2025-02-28 17:34:59,295 - root - INFO - PostgreSQL: 1000๊ฐ ๋ ์ฝ๋๋ฅผ data/raw/test_data.csv์์ ์ฝ์์ต๋๋ค
2025-02-28 17:34:59,330 - src.database.repository - INFO - Bulk inserted 1000 users
2025-02-28 17:34:59,331 - root - INFO - PostgreSQL์ 1000๊ฐ ๋ ์ฝ๋ ์ ์ฅ ์๋ฃ
2025-02-28 17:34:59,331 - root - INFO - PostgreSQL์ 1000๊ฐ ๋ ์ฝ๋ ์ ์ฅ๋จ
2025-02-28 17:34:59,333 - root - INFO - Elasticsearch: 1000๊ฐ ๋ ์ฝ๋๋ฅผ data/raw/test_data.csv์์ ์ฝ์์ต๋๋ค
2025-02-28 17:34:59,335 - src.utils.connection - INFO - Elasticsearch ํด๋ผ์ด์ธํธ ์์ฑ ์๋ฃ
2025-02-28 17:34:59,408 - elastic_transport.transport - INFO - PUT http://localhost:9200/_bulk?refresh=true [status:200 duration:0.068s]
2025-02-28 17:34:59,410 - src.database.repository - INFO - Elasticsearch์ 1000๊ฐ ๋ฌธ์ ๋ฒํฌ ์ ์ฅ ์๋ฃ
2025-02-28 17:34:59,410 - root - INFO - Elasticsearch์ 1000๊ฐ ๋ ์ฝ๋ ์ ์ฅ ์๋ฃ
2025-02-28 17:34:59,410 - root - INFO - Elasticsearch์ 1000๊ฐ ๋ ์ฝ๋ ์ ์ฅ๋จ
2025-02-28 17:34:59,414 - root - INFO - PostgreSQL์์ 5๊ฐ ๋ ์ฝ๋ ์กฐํ ์๋ฃ
2025-02-28 17:34:59,414 - root - INFO - PostgreSQL ๋ฐ์ดํฐ ํ์ธ (์ํ 5๊ฐ):
2025-02-28 17:34:59,414 - root - INFO - ๋ ์ฝ๋ 1: {'id': 1, 'name': 'Whitney Olson', 'street': '1791 Pittman Overpass', 'city': 'Lake Jason', 'zip': '48870', 'lng': Decimal('114.735089'), 'lat': Decimal('45.235433')}
2025-02-28 17:34:59,414 - root - INFO - ๋ ์ฝ๋ 2: {'id': 2, 'name': 'David Smith', 'street': '0474 Julian Station', 'city': 'West Sophia', 'zip': '72976', 'lng': Decimal('94.204753'), 'lat': Decimal('-88.761862')}
2025-02-28 17:34:59,414 - root - INFO - ๋ ์ฝ๋ 3: {'id': 3, 'name': 'Mr. Jason Hughes MD', 'street': '7351 Robinson Underpass', 'city': 'Stephaniebury', 'zip': '8702', 'lng': Decimal('-87.282108'), 'lat': Decimal('12.763472')}
2025-02-28 17:34:59,414 - root - INFO - ๋ ์ฝ๋ 4: {'id': 4, 'name': 'John Johnson', 'street': '8304 Cooper Mews', 'city': 'Candicefort', 'zip': '87821', 'lng': Decimal('-169.562279'), 'lat': Decimal('-53.845951')}
2025-02-28 17:34:59,414 - root - INFO - ๋ ์ฝ๋ 5: {'id': 5, 'name': 'Gregory Harrison', 'street': '0866 Lee Expressway Suite 888', 'city': 'Dianaport', 'zip': '14219', 'lng': Decimal('-30.874919'), 'lat': Decimal('84.261251')}
2025-02-28 17:34:59,414 - src.utils.connection - INFO - Elasticsearch ํด๋ผ์ด์ธํธ ์์ฑ ์๋ฃ
2025-02-28 17:34:59,419 - elastic_transport.transport - INFO - POST http://localhost:9200/users/_search [status:200 duration:0.005s]
2025-02-28 17:34:59,420 - root - INFO - Elasticsearch์์ 5๊ฐ ๋ ์ฝ๋ ์กฐํ ์๋ฃ
2025-02-28 17:34:59,420 - root - INFO - Elasticsearch ๋ฐ์ดํฐ ํ์ธ (์ํ 5๊ฐ):
2025-02-28 17:34:59,420 - root - INFO - ๋ ์ฝ๋ 1: {'name': 'Whitney Olson', 'age': 26, 'street': '1791 Pittman Overpass', 'city': 'Lake Jason', 'state': 'Idaho', 'zip': 48870, 'lng': 114.735089, 'lat': 45.2354325}
2025-02-28 17:34:59,420 - root - INFO - ๋ ์ฝ๋ 2: {'name': 'David Smith', 'age': 28, 'street': '0474 Julian Station', 'city': 'West Sophia', 'state': 'Arizona', 'zip': 72976, 'lng': 94.204753, 'lat': -88.761862}
2025-02-28 17:34:59,420 - root - INFO - ๋ ์ฝ๋ 3: {'name': 'Mr. Jason Hughes MD', 'age': 70, 'street': '7351 Robinson Underpass', 'city': 'Stephaniebury', 'state': 'Mississippi', 'zip': 8702, 'lng': -87.282108, 'lat': 12.763472}
2025-02-28 17:34:59,420 - root - INFO - ๋ ์ฝ๋ 4: {'name': 'John Johnson', 'age': 41, 'street': '8304 Cooper Mews', 'city': 'Candicefort', 'state': 'Rhode Island', 'zip': 87821, 'lng': -169.562279, 'lat': -53.845951}
2025-02-28 17:34:59,420 - root - INFO - ๋ ์ฝ๋ 5: {'name': 'Gregory Harrison', 'age': 24, 'street': '0866 Lee Expressway Suite 888', 'city': 'Dianaport', 'state': 'New Jersey', 'zip': 14219, 'lng': -30.874919, 'lat': 84.261251}
2025-02-28 17:34:59,420 - root - INFO - ๋ฐ์ดํฐ ์์ฑ ๋ฐ ์ ์ฅ ํ๋ก์ธ์ค ์๋ฃ์์ make start๋ฅผ ํตํด Faker๋ฅผ ํ์ฉํด ํ
์คํธ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ๋ค. ์ด ๋ฐ์ดํฐ๋ ๋ค์ ๋๋ ํ ๋ฆฌ์ ์กด์ฌํ๋ค.
-
local: /app/data/raw/test_data.csv
-
NiFi container: /opt/nifi/nifi-current/data/raw/test_data.csv
# Nifi container ์คํ
app-py3.12 {seilylook} ๐ docker exec -i -t nifi /bin/bash
# Container์ ์๋ณธ ๋ฐ์ดํฐ๊ฐ ์กด์ฌํ๋์ง ํ์ธ
nifi@e92527995ead:/opt/nifi/nifi-current$ ls data/raw/
test_data.csv
# 40์ธ ์ด์ ์ฌ๋๋ค Query ํ๊ณ Name ๊ธฐ์ค์ผ๋ก ์ ์ฅ
nifi@e92527995ead:/opt/nifi/nifi-current/data/processed$ ls -al
total 84
drwxr-xr-x 12 nifi nifi 384 Mar 7 07:17 .
drwxr-xr-x 4 root root 4096 Mar 7 07:05 ..
-rw-r--r-- 1 nifi nifi 5637 Mar 7 07:17 'Amber Taylor'
-rw-r--r-- 1 nifi nifi 5284 Mar 7 07:17 'Charles Arnold'
-rw-r--r-- 1 nifi nifi 5789 Mar 7 07:17 'Corey Hardin'
-rw-r--r-- 1 nifi nifi 6580 Mar 7 07:17 'Ebony Miller'
-rw-r--r-- 1 nifi nifi 6030 Mar 7 07:17 'Grant Garrison'
-rw-r--r-- 1 nifi nifi 5108 Mar 7 07:17 'Kristina Parker'
-rw-r--r-- 1 nifi nifi 5444 Mar 7 07:17 'Nicholas Baker MD'
-rw-r--r-- 1 nifi nifi 5277 Mar 7 07:17 'Phillip Love'
-rw-r--r-- 1 nifi nifi 6180 Mar 7 07:17 'Whitney Barnes'
-rw-r--r-- 1 nifi nifi 5438 Mar 7 07:17 'Zachary Cohen'์ฑ
63p๋ฅผ ์ฐธ์กฐํด์ Processors์ ๊ฐ Processors๋ค์ Properties๋ฅผ ์ค์ ํ๋ค. ์๋ณธ ๋ฐ์ดํฐ๋ Row๊ฐ 1000๊ฐ์ด๋ค. SplitRecord Processor์ Records Per Split๋ฅผ 100์ผ๋ก ์ค์ ํ๊ธฐ ๋๋ฌธ์ /opt/nifi/nifi-current/data/processed ์ ์๋ ๊ฒฐ๊ณผ ํ์ผ๋ค์ ๋ณด๋ฉด 10๊ฐ์์ ํ์ธํ ์ ์๋ค.
- Nifi container ์์ฑ
์ด๊ธฐ์ apache/nifi:latest ๋ฒ์ ์ผ๋ก image pull์ ์ํํ๋ version 2๋ถํฐ ์๋์ผ๋ก https:๋ก ์ฐ๊ฒฐ๋๋๋ก ์ ์๋์ด ์์๋ค. ๊ทธ๋์ ์ด์ ๊น์ง Nifi๋ฅผ ๋นผ๊ณ ๋๋จธ์ง(postgresql, elasticsearch, airflow)๋ง์ docker container๋ก ๋ง๋ค์์ผ๋ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ณ ์ถ์ด ์ฌ๋ฌ๊ฐ์ง ์๋๋ฅผ ํ๋ค๊ฐ ๋ง์ง๋ง ์๋จ์ผ๋ก version์ 1.28.0(๋ฒ์ ์ ๋ฎ์ถ ๋ ์ฃผ์ํ ์ ์ OS/ARCH == linux/arm64 ๋ฅผ ์ง์ํ๋ docker image ์ธ์ง ํ์ธํด์ผ ํ๋ค. ๋ด MAC์ arm64์ด๊ธฐ ๋๋ฌธ์ด๋ค.) ์ผ๋ก ๋ฎ์ถ๋ ์ ์์ ์ผ๋ก http: port๋ฅผ ์ฌ์ฉํด์ ์ ๊ทผํ ์ ์์๋ค.
- Nifi Processor
Nifi Processor๋ฅผ ์์ฑํ๊ณ Properties๋ฅผ ์ค์ ํ๋ ๊ฒ์ ์ฑ
๊ณผ ๋์ผํด ํฐ ๋ฌธ์ ๊ฐ ๋ฐ์ํ์ง ์์๋ค. ๊ทธ๋ฐ๋ฐ ๊ณ์ํด์ SplitRecord ๋ถ๋ถ์์ ๋ฌธ์ ๊ฐ ๋ฐ์ํ๋๋ฐ, ์์ธ์ RELATIONS ์ค์ . ์ฆ, ๊ฐ๊ฐ์ Processor๋ ์ํ๋ ์ฐ๊ฒฐ ex, success, splits, over . 40, matched ๋ฟ๋ง ์๋๋ผ failure, unmatched ๋ฑ ์์์น ๋ชปํ ์ํฉ์์ ๋ํด์ terminate | retry ๋ฅผ ์ค์ ํด์ฃผ์ด์ผ ํ๋ค. ์ฝ๊ฒ ์๊ฐํ๋ฉด ์๋จ์ !(Warning)์ด ํ๋๋ ์์ด์ผ ํ๋ค.
make build๋ฅผ ์คํํ๋ฉด Makefile์์ ์์ฑํ ๋๋ก init_elasticsearch.sh, init_postgresql.sh๋ฅผ ์คํ ์ํจ๋ค.
- init_elasticsearch.sh
#!/bin/bash
# Elasticsearch ์ปจํ
์ด๋๊ฐ ์คํ ์ค์ธ์ง ํ์ธ
if ! docker ps | grep -q "elasticsearch"; then
echo "Elasticsearch ์ปจํ
์ด๋๊ฐ ์คํ๋๊ณ ์์ง ์์ต๋๋ค."
exit 1
fi
# Elasticsearch๊ฐ ์ค๋น๋ ๋๊น์ง ๋๊ธฐ
echo "Elasticsearch๊ฐ ์ค๋น๋ ๋๊น์ง ๋๊ธฐ ์ค..."
until $(curl --silent --output /dev/null --fail --max-time 5 http://localhost:9200); do
printf '.'
sleep 5
done
echo "Elasticsearch๊ฐ ์ค๋น๋์์ต๋๋ค."
# users ์ธ๋ฑ์ค ์ค์
echo "users ์ธ๋ฑ์ค ์์ฑ ์ค..."
curl -X PUT "localhost:9200/users" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"name": { "type": "text" },
"street": { "type": "text" },
"city": { "type": "text" },
"zip": { "type": "keyword" },
"lng": { "type": "double" },
"lat": { "type": "double" }
}
}
}
'
# ๊ฒฐ๊ณผ ํ์ธ
if [ $? -eq 0 ]; then
echo "users ์ธ๋ฑ์ค๊ฐ ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค."
else
echo "users ์ธ๋ฑ์ค ์์ฑ ์ค ์ค๋ฅ๊ฐ ๋ฐ์ํ์ต๋๋ค."
exit 1
fi
# ์ธ๋ฑ์ค ํ์ธ
echo "์ธ๋ฑ์ค ๋ชฉ๋ก ํ์ธ:"
curl -X GET "localhost:9200/_cat/indices?v"- init_postgresql.sh
#!/bin/bash
# PostgreSQL ์ปจํ
์ด๋๊ฐ ์คํ ์ค์ธ์ง ํ์ธ
if ! docker ps | grep -q "postgres"; then
echo "PostgreSQL ์ปจํ
์ด๋๊ฐ ์คํ๋๊ณ ์์ง ์์ต๋๋ค."
exit 1
fi
# ๋จผ์ ๋ฐ์ดํฐ๋ฒ ์ด์ค ์กด์ฌ ์ฌ๋ถ ํ์ธ ๋ฐ ์์ฑ
echo "dataengineering ๋ฐ์ดํฐ๋ฒ ์ด์ค ์์ฑ ์ค..."
docker exec -i postgres bash -c 'PGPASSWORD=airflow psql -U airflow -d airflow -c "CREATE DATABASE dataengineering;"'
# ์ด์ dataengineering ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํ
์ด๋ธ ์์ฑ
cat << 'EOF' > /tmp/create_tables.sql
-- users ํ
์ด๋ธ์ด ์กด์ฌํ์ง ์์ผ๋ฉด ์์ฑ
CREATE TABLE IF NOT EXISTS users (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
street VARCHAR(200),
city VARCHAR(100),
zip VARCHAR(10),
lng DECIMAL(10, 6),
lat DECIMAL(10, 6)
);
-- ํ
์ด๋ธ์ด ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์๋์ง ํ์ธ
\dt
-- ํ
์ด๋ธ ๊ตฌ์กฐ ํ์ธ
\d users
EOF
# PostgreSQL ์ปจํ
์ด๋์ SQL ํ์ผ ๋ณต์ฌ
docker cp /tmp/create_tables.sql postgres:/tmp/create_tables.sql
# PostgreSQL ์ปจํ
์ด๋ ๋ด์์ SQL ์คํฌ๋ฆฝํธ ์คํ (dataengineering ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ์ง์ ์ฐ๊ฒฐ)
echo "ํ
์ด๋ธ ์์ฑ ์ค..."
docker exec -i postgres bash -c 'PGPASSWORD=airflow psql -U airflow -d dataengineering -f /tmp/create_tables.sql'
# ๊ถํ ๋ถ์ฌ
echo "๊ถํ ๋ถ์ฌ ์ค..."
docker exec -i postgres bash -c 'PGPASSWORD=airflow psql -U airflow -d dataengineering -c "GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow; GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO airflow;"'
# ์คํ ๊ฒฐ๊ณผ ํ์ธ
if [ $? -eq 0 ]; then
echo "๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํ
์ด๋ธ์ด ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค."
# ์์ ํ์ผ ์ญ์
rm /tmp/create_tables.sql
docker exec postgres rm /tmp/create_tables.sql
else
echo "์ค๋ฅ๊ฐ ๋ฐ์ํ์ต๋๋ค."
# ์์ ํ์ผ ์ญ์
rm /tmp/create_tables.sql
docker exec postgres rm /tmp/create_tables.sql
exit 1
fi
# ๋ฐ์ดํฐ๋ฒ ์ด์ค ์ฐ๊ฒฐ ํ
์คํธ
echo "๋ฐ์ดํฐ๋ฒ ์ด์ค ์ฐ๊ฒฐ ํ
์คํธ ์ค..."
docker exec postgres pg_isready -U airflow -d dataengineering
if [ $? -eq 0 ]; then
echo "๋ฐ์ดํฐ๋ฒ ์ด์ค๊ฐ ์ ์์ ์ผ๋ก ์๋ตํฉ๋๋ค."
else
echo "๋ฐ์ดํฐ๋ฒ ์ด์ค ์ฐ๊ฒฐ ํ
์คํธ์ ์คํจํ์ต๋๋ค."
exit 1
fi์ด๋ฅผ ํตํด ๊ธฐ๋ณธ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ๊ณ ๋๊ฐ์ ๋ฐ์ดํฐ๋ฅผ PostgresSQL, ElasticSearch์ ๊ฐ๊ฐ ์ ์ฅํ๋ค. ์๋์ Log์ ํตํด ์ ์์ ์ผ๋ก ์ ์ฅ๋์์์ ํ์ธํ ์ ์๋ค.
==============================================
Waiting for PostgreSQL to start...
==============================================
=====================================
Initializing PostgreSQL...
=====================================
chmod +x ./scripts/init_postgresql.sh
./scripts/init_postgresql.sh
dataengineering ๋ฐ์ดํฐ๋ฒ ์ด์ค ์์ฑ ์ค...
ERROR: database "dataengineering" already exists
Successfully copied 2.05kB to postgres:/tmp/create_tables.sql
ํ
์ด๋ธ ์์ฑ ์ค...
psql:/tmp/create_tables.sql:10: NOTICE: relation "users" already exists, skipping
CREATE TABLE
List of relations
Schema | Name | Type | Owner
--------+-------+-------+---------
public | users | table | airflow
(1 row)
Table "public.users"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('users_id_seq'::regclass)
name | character varying(100) | | not null |
street | character varying(200) | | |
city | character varying(100) | | |
zip | character varying(10) | | |
lng | numeric(10,6) | | |
lat | numeric(10,6) | | |
Indexes:
"users_pkey" PRIMARY KEY, btree (id)
๊ถํ ๋ถ์ฌ ์ค...
GRANT
๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํ
์ด๋ธ์ด ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค.
๋ฐ์ดํฐ๋ฒ ์ด์ค ์ฐ๊ฒฐ ํ
์คํธ ์ค...
/var/run/postgresql:5432 - accepting connections
๋ฐ์ดํฐ๋ฒ ์ด์ค๊ฐ ์ ์์ ์ผ๋ก ์๋ตํฉ๋๋ค.
==============================================
Waiting for Elasticsearch to start...
==============================================
=====================================
Initializing Elasticsearch...
=====================================
chmod +x ./scripts/init_elasticsearch.sh
./scripts/init_elasticsearch.sh
Elasticsearch๊ฐ ์ค๋น๋ ๋๊น์ง ๋๊ธฐ ์ค...
Elasticsearch๊ฐ ์ค๋น๋์์ต๋๋ค.
users ์ธ๋ฑ์ค ์์ฑ ์ค...
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [users/AW57UPmxTYyW3G-GdL6lHw] already exists","index_uuid":"AW57UPmxTYyW3G-GdL6lHw","index":"users"}],"type":"resource_already_exists_exception","reason":"index [users/AW57UPmxTYyW3G-GdL6lHw] already exists","index_uuid":"AW57UPmxTYyW3G-GdL6lHw","index":"users"},"status":400}users ์ธ๋ฑ์ค๊ฐ ์ฑ๊ณต์ ์ผ๋ก ์์ฑ๋์์ต๋๋ค.
์ธ๋ฑ์ค ๋ชฉ๋ก ํ์ธ:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open .internal.alerts-transform.health.alerts-default-000001 MSc-VAFyQHG9tGk2TxwbGg 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.logs.alerts-default-000001 Bief08evQ_SX_3UrwZOzwQ 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.uptime.alerts-default-000001 N-ptFAoNTYeF7OHGtOqZFw 1 0 0 0 249b 249b 249b
green open .internal.alerts-ml.anomaly-detection.alerts-default-000001 JL9JINenTS-XMJClxKXrYA 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.slo.alerts-default-000001 rFwL_h62QZmQx8kkXFZIyA 1 0 0 0 249b 249b 249b
green open .internal.alerts-default.alerts-default-000001 6xFM1NqvTc--9BJ7MlhCpA 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.apm.alerts-default-000001 D1xM5G2DTPeF25kkR6_KWQ 1 0 0 0 249b 249b 249b
green open users AW57UPmxTYyW3G-GdL6lHw 1 0 1000 0 229.1kb 229.1kb 229.1kb
green open .internal.alerts-observability.metrics.alerts-default-000001 IO4Li-8sS6-AH8aSsvEqLg 1 0 0 0 249b 249b 249b
green open .internal.alerts-ml.anomaly-detection-health.alerts-default-000001 95sA140IQ2qCyErtmHKTEg 1 0 0 0 249b 249b 249b
green open .internal.alerts-observability.threshold.alerts-default-000001 TV3IWC9fQUuYWQO-56_5sA 1 0 0 0 249b 249b 249b
green open .internal.alerts-security.alerts-default-000001 VE-qcLiURZy7h_i1hb96Lw 1 0 0 0 249b 249b 249b
green open .internal.alerts-stack.alerts-default-000001 _PEhTLb4TJ2bYAwiv6kz8w 1 0 0 0 249b 249b 249bPostgreSQL UI๋ฅผ ์ํ pgAdmin์ ๋ฐ๋ก ์ค์นํ์ง ์์๊ธฐ ๋๋ฌธ์ psql์ ํตํด ํ์ธํ๋ ๊ฒ์ผ๋ก ํ๊ณ ElasticSearch์ ๋ฐ์ดํฐ๊ฐ ์ ์ ํ ์ ์ฅ๋์๋์ง ํ์ธํ๊ธฐ ์ํด ์์ ์ค์นํ Kibana๋ฅผ ์ด์ฉํ๋ค. init_elasticsearch.sh์์ users index๋ฅผ ์์ฑํ๊ณ ์ด๊ธฐ ํ
์คํธ ๋ฐ์ดํฐ๋ฅผ ์ ์ฅํด์ฃผ์๋ค.
http://localhost:5601์ ์ ๊ทผํ๊ณ , ๋ค์์ ๊ณผ์ ์ ํตํด์ ๋ฐ์ดํฐ๊ฐ ์ ์์ ์ผ๋ก ElasticSearch์ ์ ์ฅ๋์๋์ง ํ์ธํด์ค๋ค.
- ์ผ์ชฝ Toolbar์์ Analytics -> Discover ํด๋ฆญ
- Create index pattern ํด๋ฆญ
- ์ค๋ฅธ์ชฝ ํ๋ฉด์ ๋ณด๋ฉด ์์ init_elasticsearch.sh๋ฅผ ํตํด ์์ฑํ users index๊ฐ ๋ณด์ธ๋ค. ๊ทธ๋ ๊ธฐ์ ์ผ์ชฝ์ Name์ uses๋ฅผ ๋ฃ์ด์ค๋ค.
- index๊ฐ ์ฐ๊ฒฐ๋์๊ธฐ์ field type๋ค์ ํ์ธํ ์ ์๋ค.
- ์ผ์ชฝ Toolbardptj Discover ํด๋ฆญ. ์ ์ฅ๋ ๊ฐ๋ค์ ํ์ธํ ์ ์๋ค.
- DAG ์์ฑ
import os
import datetime as dt
from datetime import timedelta
import logging
import pandas as pd
import psycopg2
from psycopg2.extras import DictCursor
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.exceptions import AirflowException
from elasticsearch import Elasticsearch
# ๋ก๊น
์ค์
logger = logging.getLogger(__name__)
# ๋ชจ๋ ๊ฐ์ ธ์ค๊ธฐ ์๋ - ์ํฌํธ ๊ฒฝ๋ก ๋ฌธ์ ํด๊ฒฐ
import sys
import importlib.util
import inspect
# ํ์ฌ ํ์ผ์ ์ ๋ ๊ฒฝ๋ก
current_file = inspect.getfile(inspect.currentframe())
# ํ์ฌ ๋๋ ํ ๋ฆฌ (airflow/dags)
current_dir = os.path.dirname(os.path.abspath(current_file))
# Airflow ํ ๋๋ ํ ๋ฆฌ (/opt/airflow)
airflow_home = os.path.dirname(os.path.dirname(current_dir))
# app ๋๋ ํ ๋ฆฌ ๊ฒฝ๋ก
app_dir = os.path.join(airflow_home, "app")
# Python ๊ฒฝ๋ก์ app ๋๋ ํ ๋ฆฌ ์ถ๊ฐ
if app_dir not in sys.path:
sys.path.insert(0, app_dir)
logger.info(f"Python ๊ฒฝ๋ก์ ์ถ๊ฐ๋จ: {app_dir}")
# ๊ฒฝ๋ก ๋๋ฒ๊น
logger.info(f"Python ๊ฒฝ๋ก: {sys.path}")
logger.info(f"ํ์ฌ ๋๋ ํ ๋ฆฌ: {os.getcwd()}")
logger.info(f"ํ์ฌ ํ์ผ: {current_file}")
logger.info(f"app ๋๋ ํ ๋ฆฌ ๊ฒฝ๋ก: {app_dir}")
# ๋ชจ๋ ๊ฐ์ ธ์ค๊ธฐ ์๋
USE_REPOSITORY_PATTERN = False
try:
from src.database.repository import RepositoryFactory
from src.config.database import PostgresConfig, ElasticsearchConfig
from src.utils.connection import PostgresConnector, ElasticsearchConnector
# ์ฑ๊ณต์ ์ผ๋ก ๊ฐ์ ธ์๋์ง ํ์ธ
if all(
[
RepositoryFactory,
PostgresConfig,
ElasticsearchConfig,
PostgresConnector,
ElasticsearchConnector,
]
):
USE_REPOSITORY_PATTERN = True
logger.info("Repository ๋ชจ๋ ๊ฐ์ ธ์ค๊ธฐ ์ฑ๊ณต")
except ImportError as e:
logger.error(f"Repository ๋ชจ๋ ๊ฐ์ ธ์ค๊ธฐ ์คํจ: {e}")
logger.info("์ง์ ๊ตฌํ์ผ๋ก ๋์ฒดํฉ๋๋ค")
# DAG ์ค์
default_args = {
"owner": "Se Hyeon Kim",
"start_date": dt.datetime(2025, 3, 1),
"retries": 3,
"retry_delay": dt.timedelta(minutes=5),
"email_on_failure": True,
"email_on_retry": False,
}
# PostgreSQL ์ฐ๊ฒฐ ์ ๋ณด
PG_HOST = "postgres" # Docker Compose์ ์๋น์ค ์ด๋ฆ
PG_PORT = 5432
PG_DATABASE = "airflow"
PG_USER = "airflow"
PG_PASSWORD = "airflow"
# Elasticsearch ์ฐ๊ฒฐ ์ ๋ณด
ES_HOST = "elasticsearch" # Docker Compose์ ์๋น์ค ์ด๋ฆ
ES_PORT = 9200
def extract_from_postgresql(**context):
"""PostgreSQL์์ ์ฌ์ฉ์ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ์ฌ CSV ํ์ผ๋ก ์ ์ฅ"""
global USE_REPOSITORY_PATTERN # ์ ์ญ ๋ณ์๋ก ์ ์ธ
try:
if USE_REPOSITORY_PATTERN:
# Repository ํจํด์ ์ฌ์ฉํ์ฌ ๋ฐ์ดํฐ ์ ๊ทผ
try:
logger.info("Repository ํจํด์ผ๋ก PostgreSQL ์ ๊ทผ ์๋")
postgres_config = PostgresConfig()
postgres_connector = PostgresConnector(config=postgres_config)
repository = RepositoryFactory.create(
"postgresql", connector=postgres_connector
)
# ์ฐ๊ฒฐ ํ์ธ
if not repository.check_connection():
raise Exception("PostgreSQL ์ฐ๊ฒฐ์ ์คํจํ์ต๋๋ค.")
# ๋ฐ์ดํฐ ์กฐํ
users = repository.get_all(limit=1000)
if not users:
logger.warning("PostgreSQL์์ ์กฐํ๋ ์ฌ์ฉ์ ์์")
return False
# DataFrame์ผ๋ก ๋ณํ
df = pd.DataFrame(users)
logger.info(
f"Repository ํจํด์ผ๋ก {len(df)}๋ช
์ ์ฌ์ฉ์ ๋ฐ์ดํฐ ์ถ์ถ ์๋ฃ"
)
except Exception as repo_error:
logger.error(f"Repository ํจํด ์ฌ์ฉ ์คํจ: {repo_error}")
logger.info("์ง์ DB ์ฐ๊ฒฐ๋ก ๋์ฒดํฉ๋๋ค")
# Repository ํจํด ์คํจ ์ ์ง์ ์ฐ๊ฒฐ๋ก ๋์ฒด
USE_REPOSITORY_PATTERN = False
raise repo_error
if not USE_REPOSITORY_PATTERN:
# ์ง์ psycopg2๋ก PostgreSQL์ ์ฐ๊ฒฐ
conn_string = f"host={PG_HOST} port={PG_PORT} dbname={PG_DATABASE} user={PG_USER} password={PG_PASSWORD}"
logger.info(f"PostgreSQL ์ง์ ์ฐ๊ฒฐ: {conn_string}")
conn = psycopg2.connect(conn_string)
with conn.cursor(cursor_factory=DictCursor) as cur:
# users ํ
์ด๋ธ์ด ์์ผ๋ฏ๋ก ๋ค๋ฅธ ํ
์ด๋ธ ์ฌ์ฉ (์: dag_run)
cur.execute("SELECT * FROM dag_run LIMIT 100")
rows = cur.fetchall()
if not rows:
logger.warning("PostgreSQL์์ ์กฐํ๋ ๋ฐ์ดํฐ ์์")
return False
# ๋ฐ์ดํฐ๋ฅผ ๋์
๋๋ฆฌ ๋ฆฌ์คํธ๋ก ๋ณํ
data = [dict(row) for row in rows]
df = pd.DataFrame(data)
conn.close()
# CSV ์ ์ฅ
output_path = "/tmp/postgresql_users.csv"
df.to_csv(output_path, index=False)
context["ti"].xcom_push(key="csv_path", value=output_path)
logger.info(f"PostgreSQL์์ {len(df)}๊ฐ์ ๋ฐ์ดํฐ ์ถ์ถ ์๋ฃ")
return True
except Exception as e:
logger.error(f"PostgreSQL ๋ฐ์ดํฐ ์ถ์ถ ์คํจ: {e}")
# ํ
์คํธ ๋ฐ์ดํฐ ์์ฑ (์ค์ DB ์ฐ๊ฒฐ์ด ์๋ ๊ฒฝ์ฐ๋ฅผ ์ํ ๋ฐฑ์
)
try:
logger.info("ํ
์คํธ ๋ฐ์ดํฐ๋ฅผ ์์ฑํฉ๋๋ค")
test_data = [
{
"id": 1,
"name": "User 1",
"street": "Street 1",
"city": "City 1",
"zip": "11111",
"lng": "1.1",
"lat": "1.1",
},
{
"id": 2,
"name": "User 2",
"street": "Street 2",
"city": "City 2",
"zip": "22222",
"lng": "2.2",
"lat": "2.2",
},
{
"id": 3,
"name": "User 3",
"street": "Street 3",
"city": "City 3",
"zip": "33333",
"lng": "3.3",
"lat": "3.3",
},
]
df = pd.DataFrame(test_data)
output_path = "/tmp/postgresql_users.csv"
df.to_csv(output_path, index=False)
context["ti"].xcom_push(key="csv_path", value=output_path)
logger.info("ํ
์คํธ ๋ฐ์ดํฐ ์์ฑ ์๋ฃ")
return True
except Exception as test_error:
logger.error(f"ํ
์คํธ ๋ฐ์ดํฐ ์์ฑ ์คํจ: {test_error}")
raise AirflowException(f"๋ฐ์ดํฐ ์ถ์ถ ์ค๋ฅ: {str(e)}")
def load_to_elasticsearch(**context):
"""CSV ํ์ผ ๋ฐ์ดํฐ๋ฅผ Elasticsearch์ ์ ์ฌ"""
global USE_REPOSITORY_PATTERN # ์ ์ญ ๋ณ์๋ก ์ ์ธ
try:
# CSV ํ์ผ ๊ฒฝ๋ก ๊ฐ์ ธ์ค๊ธฐ
ti = context["ti"]
csv_path = ti.xcom_pull(task_ids="extract_postgresql_data", key="csv_path")
if not csv_path:
raise AirflowException(
"์ด์ ํ์คํฌ์์ CSV ํ์ผ ๊ฒฝ๋ก๋ฅผ ๊ฐ์ ธ์ฌ ์ ์์ต๋๋ค."
)
# CSV ํ์ผ ์ฝ๊ธฐ
df = pd.read_csv(csv_path)
if df.empty:
logger.warning("์ ์ฌํ ์ฌ์ฉ์ ๋ฐ์ดํฐ๊ฐ ์์ต๋๋ค")
return False
if USE_REPOSITORY_PATTERN:
# Repository ํจํด์ ์ฌ์ฉํ์ฌ Elasticsearch์ ์ ์ฌ
try:
logger.info("Repository ํจํด์ผ๋ก Elasticsearch ์ ๊ทผ ์๋")
es_config = ElasticsearchConfig()
es_connector = ElasticsearchConnector(config=es_config)
repository = RepositoryFactory.create(
"elasticsearch",
connector=es_connector,
index="users_from_postgresql",
)
# ์ฐ๊ฒฐ ํ์ธ
if not repository.check_connection():
raise Exception("Elasticsearch ์ฐ๊ฒฐ์ ์คํจํ์ต๋๋ค.")
# ๋ฐ์ดํฐ ๋ณํ ๋ฐ ์ ์ฌ
records = df.to_dict("records")
inserted_count = repository.bulk_save(records)
logger.info(f"Elasticsearch์ {inserted_count}๊ฐ ๋ฌธ์ ์ ์ฌ ์๋ฃ")
return True
except Exception as repo_error:
logger.error(f"Repository ํจํด ์ฌ์ฉ ์คํจ: {repo_error}")
logger.info("์ง์ Elasticsearch ์ฐ๊ฒฐ๋ก ๋์ฒดํฉ๋๋ค")
# ์คํจ ์ ์ง์ ์ฐ๊ฒฐ๋ก ๋์ฒด
USE_REPOSITORY_PATTERN = False
raise repo_error
if not USE_REPOSITORY_PATTERN:
# Elasticsearch ์ง์ ์ฐ๊ฒฐ
es_url = f"http://{ES_HOST}:{ES_PORT}"
logger.info(f"Elasticsearch ์ง์ ์ฐ๊ฒฐ: {es_url}")
es = Elasticsearch([es_url])
# ์ฐ๊ฒฐ ํ์ธ
if not es.ping():
logger.error("Elasticsearch ์ฐ๊ฒฐ ์คํจ")
logger.info(
"์์
์ ์๋ฃ๋ ๊ฒ์ผ๋ก ํ์ํฉ๋๋ค (์ค์ ๋ฐ์ดํฐ๋ ์ ์ฌ๋์ง ์์)"
)
return True # ํ
์คํธ ํ๊ฒฝ์์๋ ์ฑ๊ณต์ผ๋ก ์ฒ๋ฆฌ
# ๋ฐ์ดํฐ ๋ณํ ๋ฐ ์ ์ฌ
index_name = "users_from_postgresql"
bulk_data = []
for _, row in df.iterrows():
bulk_data.append({"index": {"_index": index_name}})
bulk_data.append(row.to_dict())
if bulk_data:
es.bulk(operations=bulk_data, refresh=True)
logger.info(f"Elasticsearch์ {len(df)}๊ฐ ๋ฌธ์ ์ ์ฌ ์๋ฃ")
return True
except Exception as e:
logger.error(f"Elasticsearch ๋ฐ์ดํฐ ์ ์ฌ ์คํจ: {e}")
# ๊ฐ๋ฐ/ํ
์คํธ ํ๊ฒฝ์์๋ ์ด ์ค๋ฅ๋ฅผ ๋ฌด์ํ๊ณ ์งํ
logger.info("Elasticsearch ์ ์ฌ์ ์คํจํ์ง๋ง ์์
์ ์๋ฃ๋ ๊ฒ์ผ๋ก ํ์ํฉ๋๋ค")
return True
# DAG ์ ์
with DAG(
dag_id="user_data_transfer",
default_args=default_args,
description="PostgreSQL ์ฌ์ฉ์ ๋ฐ์ดํฐ๋ฅผ Elasticsearch๋ก ์ ์ก",
schedule_interval=timedelta(hours=1),
catchup=False,
tags=["postgresql", "elasticsearch", "user_data"],
) as dag:
extract_task = PythonOperator(
task_id="extract_postgresql_data",
python_callable=extract_from_postgresql,
)
load_task = PythonOperator(
task_id="load_elasticsearch_data",
python_callable=load_to_elasticsearch,
)
# ํ์คํฌ ์์กด์ฑ ์ค์
extract_task >> load_task- ์คํ ๊ฒฐ๊ณผ
- extract_postgresql_data
ac7a746c901c
โถ Log message source details
[2025-03-08, 04:04:08 UTC] {local_task_job_runner.py:123} โถ Pre task execution logs
[2025-03-08, 04:04:08 UTC] {main.py:128} INFO - PostgreSQL ์ง์ ์ฐ๊ฒฐ: host=postgres port=5432 dbname=*** user=*** password=***
[2025-03-08, 04:04:08 UTC] {main.py:152} INFO - PostgreSQL์์ 1๊ฐ์ ๋ฐ์ดํฐ ์ถ์ถ ์๋ฃ
[2025-03-08, 04:04:08 UTC] {python.py:240} INFO - Done. Returned value was: True
[2025-03-08, 04:04:08 UTC] {taskinstance.py:341} โถ Post task execution logs- load_elasticsearch_data
ac7a746c901c
โถ Log message source details
[2025-03-08, 04:04:09 UTC] {local_task_job_runner.py:123} โถ Pre task execution logs
[2025-03-08, 04:04:09 UTC] {main.py:253} INFO - Elasticsearch ์ง์ ์ฐ๊ฒฐ: http://elasticsearch:9200
[2025-03-08, 04:04:09 UTC] {_transport.py:349} INFO - HEAD http://elasticsearch:9200/ [status:200 duration:0.003s]
[2025-03-08, 04:04:09 UTC] {_transport.py:349} INFO - PUT http://elasticsearch:9200/_bulk?refresh=true [status:200 duration:0.096s]
[2025-03-08, 04:04:09 UTC] {main.py:275} INFO - Elasticsearch์ 1๊ฐ ๋ฌธ์ ์ ์ฌ ์๋ฃ
[2025-03-08, 04:04:09 UTC] {python.py:240} INFO - Done. Returned value was: True
[2025-03-08, 04:04:09 UTC] {taskinstance.py:341} โถ Post task execution logsapp-py3.12 {seilylook} โ๏ธ curl -X GET "localhost:9200/_cat/indices?v"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .geoip_databases 7wlLmCbzT-68QQtczL1XlQ 1 0 39 0 36.2mb 36.2mb
green open .kibana_task_manager_7.17.28_001 32xC4NKOQCWKlK_Lta3KuA 1 0 17 1206 267.3kb 267.3kb
yellow open fromnifi 7wEUB-_YTI6ZpVo9nVu0Fg 1 1 450400 0 189.9mb 189.9mb
green open .kibana_7.17.28_001 lUIvEKPPQwiGzG94W_LaVA 1 0 15 0 2.3mb 2.3mb
green open .apm-custom-link guOxO9XJSdmOvvEuTyOdnA 1 0 0 0 227b 227b
green open .apm-agent-configuration 77OEbeUsSsW8bGctTzmD1A 1 0 0 0 227b 227b
green open users iI4uC4UiRPyFBgts1aAB7w 1 0 1000 0 231.9kb 231.9kbDocker ํ๊ฒฝ์์ Postgres, Elasticsearch, Nifi๋ฅผ ์คํํ๋ ์ํ์ด๊ธฐ ๋๋ฌธ์ ๋ค์๊ณผ ๊ฐ์ด ๋ค๋ฅด๊ฒ ๋ฐ๋ก ์ธํ ์ด ํ์ํ๋ค.
PostgreSQL, Elasticsearch connection ํ ๋, postgres๋ jar๋ฅผ ์ด์ฉํด์ผ ํ๋ค. PostgreSQL ๊ณต์ ํํ์ด์ง์์ ์ฌ์ฉํ๊ณ ์ ํ๋ ๋ฒ์ ์ jar์ ๋ค์ด๋ฐ์ local์ nifi/drivers์ ์ ์ฅํด์คฌ๋ค. ์ด๋ฅผ docker-compose์์ nifi container๋ฅผ ๋ง๋ค ๋ mount ์์ผ์ nifi container์์ PostgreSQL jar๋ฅผ ์ธ์ํ ์ ์๋๋ก ํด์ค๋ค.
services:
nifi:
image: apache/nifi:1.28.0
container_name: nifi
restart: always
ports:
- "9300:9300"
environment:
- NIFI_WEB_HTTP_HOST=0.0.0.0
- NIFI_WEB_HTTP_PORT=9300
- NIFI_WEB_PROXY_HOST=localhost:9300
- SINGLE_USER_CREDENTIALS_USERNAME=nifi
- SINGLE_USER_CREDENTIALS_PASSWORD=nifipassword
volumes:
- nifi-system-data:/opt/nifi/nifi-current/system-data # ๋ด๋ถ ์์คํ
๋ฐ์ดํฐ
- ./logs/nifi:/opt/nifi/nifi-current/logs
- nifi-conf:/opt/nifi/nifi-current/conf
- ./nifi/data/raw:/opt/nifi/nifi-current/data/raw # ์๋ณธ ๋ฐ์ดํฐ
- ./nifi/data/processed:/opt/nifi/nifi-current/data/processed # ์ฒ๋ฆฌ๋ ๋ฐ์ดํฐ
- ./nifi/templates:/opt/nifi/nifi-current/templates
- ./nifi/drivers:/opt/nifi/nifi-current/lib/custom-drivers์ฌ๋ฐ๋ฅด๊ฒ mount ๋์๋์ง ํ์ธํด์ค๋ค.
app-py3.12 {seilylook} โ๏ธ docker exec -i -t nifi /bin/bash
nifi@d8ffbc7cb03c:/opt/nifi/nifi-current$ cd lib/custom-drivers/
nifi@d8ffbc7cb03c:/opt/nifi/nifi-current/lib/custom-drivers$ ls
postgresql-42.7.5.jar์ด์ด์ DBCPConnectionPool ์๋น์ค๋ฅผ ๋ค์๊ณผ ๊ฐ์ด ์ค์ ํด์ค๋ค.
- Database Connection URL:
jdbc:postgresql://postgres:5432/dataengineering
- Database Driver Class Name:
org.postgresql.Driver
- Database Driver Locations:
/opt/nifi/nifi-current/lib/custom-drivers/postgresql-42.7.1.jar
- Database User:
airflow
- Password:
airflow
๋ง์ง๋ง์ผ๋ก PutElasticsearchHtttp processor์ properties์์ URL์ Port๋ฅผ ๋ค์์ผ๋ก ์ค์ ํด์ค๋ค.
NiFi ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ์์ ๋ค์๊ณผ ๊ฐ์ ์ํฌํ๋ก์ฐ๋ฅผ ๊ตฌ์ฑํ์ต๋๋ค:
- ExecuteScript (1์ฐจ): ์ธ๋ถ API์์ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค๋ ์คํฌ๋ฆฝํธ
- SplitJSON: ๊ฐ์ ธ์จ ๋ฐ์ดํฐ๋ฅผ ๊ฐ๋ณ ๋ ์ฝ๋๋ก ๋ถํ
- ExecuteScript (2์ฐจ): ๊ฐ ๋ ์ฝ๋์ ์ถ๊ฐ ํ๋ ์์ฑ ๋ฐ ๋ฐ์ดํฐ ๋ณํ
- EvaluateJsonPath: JSON์์ id ํ๋ ์ถ์ถ
- PutElasticsearchHTTP: Elasticsearch์ ๋ฐ์ดํฐ ์ ์ฅ
๊ทธ๋ฌ๋ ๋ง์ง๋ง PutElasticsearchHTTP ํ๋ก์ธ์์์ ๋ค์ ์ค๋ฅ๊ฐ ๋ฐ์ํ์ต๋๋ค:
Failed to process Flowfile due to failed to parse, transfering to failure๋ก๊ทธ๋ฅผ ์์ธํ ์ดํด๋ณด๋ ๋ค์๊ณผ ๊ฐ์ ๊ตฌ์ฒด์ ์ธ ์ค๋ฅ ๋ฉ์์ง๋ฅผ ํ์ธํ ์ ์์์ต๋๋ค:
Index operation upsert requires a valid identifier value from a flow file attribute, transferring to failure.๋ก๊ทธ ๋ถ์ ๊ฒฐ๊ณผ, ๋ค์ ์ฌํญ์ ํ์ธํ์ต๋๋ค:
-
์์ธ: PutElasticsearchHTTP ํ๋ก์ธ์๊ฐ 'upsert' ์์ ์ ์ํํ๊ธฐ ์ํด ํ์ํ ์๋ณ์(ID) ๊ฐ์ด ์ฌ๋ฐ๋ฅด๊ฒ ์ค์ ๋์ง ์์์ต๋๋ค.
-
์ธ๋ถ ์ฌํญ:
- FlowFile ํฌ๊ธฐ๊ฐ ๋ชจ๋ 8๋ฐ์ดํธ๋ก, ์ค์ JSON ๋ฐ์ดํฐ๊ฐ ์๋ ๊ฐ๋ฅ์ฑ์ด ๋์์ต๋๋ค.
- EvaluateJsonPath ํ๋ก์ธ์์์ id๋ฅผ ์ถ์ถํ๊ณ ์์์ง๋ง, ์ด ๊ฐ์ด PutElasticsearchHTTP ํ๋ก์ธ์์์ ์ธ์ํ๋ ํ์์ผ๋ก ์ค์ ๋์ง ์์์ต๋๋ค.
-
PutElasticsearchHTTP ํ๋ก์ธ์ ์ค์ ์์ PutElasticsearchHTTP ํ๋ก์ธ์์ ์ค์ ์ ๋ค์๊ณผ ๊ฐ์ด ์์ ํ์ต๋๋ค:
- Identifier Attribute: id (EvaluateJsonPath์์ ์ถ์ถํ ์์ฑ๋ช ๊ณผ ์ผ์นํ๋๋ก ์ค์ )
- Index Operation: ๊ธฐ์กด upsert ์ค์ ํ์ธ
- Type: Elasticsearch ๋ฒ์ ์ ๋ง๊ฒ ์ค์ (7.x ์ด์์ ๊ฒฝ์ฐ ๋น์๋๊ฑฐ๋ _doc ์ฌ์ฉ)
-
UpdateAttribute ํ๋ก์ธ์ ์ถ๊ฐ EvaluateJsonPath์ PutElasticsearchHTTP ์ฌ์ด์ UpdateAttribute ํ๋ก์ธ์๋ฅผ ์ถ๊ฐํ์ฌ ID ๊ฐ์ ๋ช ์์ ์ผ๋ก ์ค์ ํ์ต๋๋ค:
-
์ ํ๋ก์ธ์: UpdateAttribute
-
์์ฑ ์ค์ :
- ์์ฑ๋ช : elasticsearch.id
- ๊ฐ: ${id} (EvaluateJsonPath์์ ์ถ์ถํ id ์์ฑ ์ฌ์ฉ)
-
-
PutElasticsearchHTTP ํ๋ก์ธ์ ์ค์ ์ถ๊ฐ ์์ UpdateAttribute ์ถ๊ฐ ํ, PutElasticsearchHTTP ํ๋ก์ธ์์ ์ค์ ์ ๋ค์ ์์ ํ์ต๋๋ค:
- Identifier Attribute: elasticsearch.id (UpdateAttribute์์ ์ค์ ํ ์์ฑ๋ช ์ผ๋ก ๋ณ๊ฒฝ)
-
๋๋ฒ๊น ์ ์ํ LogAttribute ํ์ฉ ๋ฌธ์ ํด๊ฒฐ ๊ณผ์ ์์ ๋ค์๊ณผ ๊ฐ์ด LogAttribute ํ๋ก์ธ์๋ฅผ ํ์ฉํ์ฌ ๋ฐ์ดํฐ ํ๋ฆ์ ๋ชจ๋ํฐ๋งํ์ต๋๋ค:
-
LogAttribute ์ค์ :
- Log Level: INFO ๋๋ DEBUG
- Log Payload: true (FlowFile ๋ด์ฉ์ ํจ๊ป ๋ก๊น )
- Attributes to Log: all attributes (๋ชจ๋ ์์ฑ ๋ก๊น )
-
Elasticsearch์ ์ ์์ ์ผ๋ก ์ ์ฅ๋ ๊ฒ์ ํ์ธํ ์ ์์์ต๋๋ค.
app-py3.12 {seilylook} ๐ curl -X GET "localhost:9200/_cat/indices?v"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .geoip_databases qhv1z7QnTAit-MJzq4nVxw 1 0 39 0 36.7mb 36.7mb
green open .kibana_task_manager_7.17.28_001 PkOLULcPR02nuAckO0GZQg 1 0 17 228 141kb 141kb
green open .apm-custom-link bCszQPeqTm6YGadNEOdDbw 1 0 0 0 227b 227b
green open .kibana_7.17.28_001 frGA6utWSFSA-o3uovgPBw 1 0 11 0 2.3mb 2.3mb
yellow open scf L7BtsSUNQXK1FxOEmM_tPA 1 1 5000 0 1.7mb 1.7mb
green open .apm-agent-configuration EKRPRn9aRbepTGm81kxlXQ 1 0 0 0 227b 227b
green open users hFnaau-_TtaDclMcC5Ocpw 1 0 0 0 227b 227bIn stream processing, the data may be inifinite and incomplete at the time of a query. One of the leading tools in handling streaming data is Apache Kafka. Kafka is a tool that allows you to send dat in real time to topics. These topics can be read by consumers who process the data.
services:
# Zookeeper service
zookeeper:
image: confluentinc/cp-zookeeper:7.4.0
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
volumes:
- zookeeper-data:/var/lib/zookeeper/data
- zookeeper-log:/var/lib/zookeeper/log
- ./logs/zookeeper:/var/log/zookeeper
healthcheck:
test: ["CMD", "nc", "-z", "localhost", "2181"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
networks:
- data-platform
# Kafka Broker 1
kafka1:
image: confluentinc/cp-kafka:7.4.0
container_name: kafka1
ports:
- "9092:9092"
- "29092:29092"
depends_on:
zookeeper:
condition: service_healthy
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:9092,PLAINTEXT_HOST://localhost:29092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
KAFKA_LOG4J_ROOT_LOGLEVEL: INFO
KAFKA_TOOLS_LOG4J_LOGLEVEL: INFO
volumes:
- kafka1-data:/var/lib/kafka/data
- ./logs/kafka1:/var/log/kafka
healthcheck:
test: ["CMD", "kafka-topics", "--bootstrap-server", "localhost:9092", "--list"]
interval: 30s
timeout: 10s
retries: 5
start_period: 45s
networks:
- data-platform
# Kafka Broker 2
kafka2:
image: confluentinc/cp-kafka:7.4.0
container_name: kafka2
ports:
- "9093:9093"
- "29093:29093"
depends_on:
zookeeper:
condition: service_healthy
environment:
KAFKA_BROKER_ID: 2
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka2:9093,PLAINTEXT_HOST://localhost:29093
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
KAFKA_LOG4J_ROOT_LOGLEVEL: INFO
KAFKA_TOOLS_LOG4J_LOGLEVEL: INFO
volumes:
- kafka2-data:/var/lib/kafka/data
- ./logs/kafka2:/var/log/kafka
healthcheck:
test: ["CMD", "kafka-topics", "--bootstrap-server", "localhost:9093", "--list"]
interval: 30s
timeout: 10s
retries: 5
start_period: 45s
networks:
- data-platform
# Kafka Broker 3
kafka3:
image: confluentinc/cp-kafka:7.4.0
container_name: kafka3
ports:
- "9094:9094"
- "29094:29094"
depends_on:
zookeeper:
condition: service_healthy
environment:
KAFKA_BROKER_ID: 3
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka3:9094,PLAINTEXT_HOST://localhost:29094
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 2
KAFKA_LOG4J_LOGGERS: "kafka.controller=INFO,kafka.producer.async.DefaultEventHandler=INFO,state.change.logger=INFO"
KAFKA_LOG4J_ROOT_LOGLEVEL: INFO
KAFKA_TOOLS_LOG4J_LOGLEVEL: INFO
volumes:
- kafka3-data:/var/lib/kafka/data
- ./logs/kafka3:/var/log/kafka
healthcheck:
test: ["CMD", "kafka-topics", "--bootstrap-server", "localhost:9094", "--list"]
interval: 30s
timeout: 10s
retries: 5
start_period: 45s
networks:
- data-platform
# Kafka-UI
kafka-ui:
image: provectuslabs/kafka-ui:v0.7.2
container_name: kafka-ui
ports:
- "8989:8080"
depends_on:
- kafka1
- kafka2
- kafka3
environment:
KAFKA_CLUSTERS_0_NAME: data-platform-cluster
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka1:9092,kafka2:9093,kafka3:9094
KAFKA_CLUSTERS_0_ZOOKEEPER: zookeeper:2181
KAFKA_CLUSTERS_0_METRICS_PORT: 9997
restart: always
networks:
- data-platform- Topics ํ ์คํธ
docker exec -i -t kafka1 bash
kafka-topics --bootstrap-server kafka1:9092,kafka2:9093,kafka3:9094 --create --topic {TOPIC_NAME} --partitions 3 --replication-factor 3
Created topic dataengineering.- Topics ํ์ธ
kafka-topics --bootstrap-server kafka1:9092 --list
dataengineering
kafka-topics --bootstrap-server kafka1:9092 --describe --topic dataengineering
Topic: dataengineering TopicId: FhQDsGqqQVaJARfp--T_tw PartitionCount: 3 ReplicationFactor: 3 Configs: min.insync.replicas=2
Topic: dataengineering Partition: 0 Leader: 3 Replicas: 3,2,1 Isr: 3,2,1
Topic: dataengineering Partition: 1 Leader: 1 Replicas: 1,3,2 Isr: 1,3,2
Topic: dataengineering Partition: 2 Leader: 2 Replicas: 2,1,3 Isr: 2,1,3- Messages ํ ์คํธ
kafka-console-producer --bootstrap-server kafka1:9092 kafka2:9093 kafka3:9094 --topic dataengineering
> ์๋
ํ์ธ์ ๋ฉ์์ง 1์
๋๋ค.
> ์๋
ํ์ธ์ ๋ฉ์์ง 2์
๋๋ค.
> {"name": "ํ
์คํธ", "value": 123}- Read Message
์๋ก์ด ํฐ๋ฏธ๋ ์ด๊ธฐ
kafka-console-consumer --bootstrap-server kafka1:9092,kafka2:9093,kafka3:9094 --topic dataengineering --from-beginning
์๋
ํ์ธ์ ๋ฉ์๏ฟฝ์ง 1์
๋๋ค.
์๋
ํ์ธ์ ๋ฉ์์ง 2์
๋๋ค.
{"๏ฟฝname": "ํ
์คํธ", "value": 123}
Processed a total of 4 messagesapp-py3.12 {seilylook} ๐ก ๎ฐ ~/Development/Book/Data_Engineering_with_Python/app ๎ฐ ๎ main ยฑ ๎ฐ make start
=========================
Starting the application...
=========================
python -m src.main
2025-03-20 21:36:56,208 - root - INFO - ๋ฐ์ดํฐ์
์ด ์ด๋ฏธ ์กด์ฌํฉ๋๋ค: data/raw/test_data.csv
2025-03-20 21:36:56,208 - root - INFO - ========================
2025-03-20 21:36:56,208 - root - INFO - Kafka Topic & Message ์์ฑ
2025-03-20 21:36:56,208 - root - INFO - ========================
2025-03-20 21:36:56,227 - src.services.data_streaming - INFO - Kafka ํด๋ฌ์คํฐ ์ฐ๊ฒฐ ๋ฐ ํ ํฝ ํ์ธ ์ค...
2025-03-20 21:36:56,255 - src.services.data_streaming - WARNING - 'users' ํ ํฝ์ด ์กด์ฌํ์ง ์์ต๋๋ค. ์๋ ์์ฑ๋ ์ ์์ต๋๋ค.
2025-03-20 21:36:56,255 - src.services.data_streaming - INFO - 'data/raw/test_data.csv' ํ์ผ ์ฒ๋ฆฌ ์์
2025-03-20 21:36:56,261 - src.services.data_streaming - INFO - 100๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,264 - src.services.data_streaming - INFO - 200๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,267 - src.services.data_streaming - INFO - 300๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,270 - src.services.data_streaming - INFO - 400๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,272 - src.services.data_streaming - INFO - 500๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,275 - src.services.data_streaming - INFO - 600๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,277 - src.services.data_streaming - INFO - 700๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,280 - src.services.data_streaming - INFO - 800๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,282 - src.services.data_streaming - INFO - 900๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:56,285 - src.services.data_streaming - INFO - 1000๊ฐ ๋ฉ์์ง ์ฒ๋ฆฌ ์ค...
2025-03-20 21:36:58,524 - src.services.data_streaming - INFO - ์ด 1000๊ฐ ๋ฉ์์ง๊ฐ 'users' ํ ํฝ์ผ๋ก ์ ์ก๋์์ต๋๋ค.- DNS ํด์ ์คํจ
%3|1742473184.651|FAIL|csv-producer#producer-1| [thrd:kafka2:9093/bootstrap]: kafka2:9093/bootstrap: Failed to resolve 'kafka2:9093': nodename nor servname provided, or not known (after 3ms in state CONNECT, 1 identical error(s) suppressed)- ์ฐ๊ฒฐ ์๊ฐ ์ด๊ณผ
%4|1742473953.566|FAIL|csv-producer#producer-1| [thrd:172.18.0.7:9093/bootstrap]: 172.18.0.7:9093/bootstrap: Connection setup timed out in state CONNECT (after 30030ms in state CONNECT)
%4|1742473954.564|FAIL|csv-producer#producer-1| [thrd:172.18.0.6:9092/bootstrap]: 172.18.0.6:9092/bootstrap: Connection setup timed out in state CONNECT (after 30028ms in state CONNECT)
%4|1742473955.569|FAIL|csv-producer#producer-1| [thrd:172.18.0.8:9094/bootstrap]: 172.18.0.8:9094/bootstrap: Connection setup timed out in state CONNECT (after 30029ms in state CONNECT)- kafka ์ด์ค ๋ฆฌ์ค๋ ์ค์
kafka๋ Docker ์ปจํ ์ด๋ ํ๊ฒฝ์์ ๋ ๊ฐ์ง ๋ฆฌ์ค๋๋ฅผ ์ฌ์ฉํ๋ค.
-
๋ด๋ถ ํต์ ์ฉ ๋ฆฌ์ค๋:
PLAINTEXT://kafka1:9092-
์ปจํฐ์๋ ๊ฐ ๋ด๋ถ ํต์ ์ ์ฌ์ฉ๋จ
-
Docker ๋ด๋ถ DNS๋ก ํด์๋์ด์ผ ํจ
-
-
์ธ๋ถ ์ ๊ทผ์ฉ ๋ฆฌ์ค๋:
PLAINTEXT_HOST://localhost:29092-
ํธ์คํธ ๋จธ์ ์์ ์ ๊ทผํ ๋ ์ฌ์ฉ๋จ
-
์ธ๋ถ๋ก ํฌํธ๊ฐ ๋ ธ์ถ๋จ
-
- ๊ตฌ์ฒด์ ์ธ ์ค๋ฅ ์์ธ
-
DNS ํด์ ์คํจ
-
ํด๋ผ์ด์ธํธ๊ฐ
kafka1:9092, kafka2:9093, kafka3:9094์ ๊ฐ์ ํธ์คํธ๋ช ์ IP์ฃผ์๋ก ํด์ํ์ง ๋ชปํจ -
์ด๋ Docker ๋คํธ์ํฌ ์ธ๋ถ์์ ์ ๊ทผํ๊ฑฐ๋, DNS ์ค์ ์ด ์ ๋๋ก ๋์ง ์์ ๊ฒฝ์ฐ ๋ฐ์.
-
-
์ฐ๊ฒฐ ์๊ฐ ์ด๊ณผ
-
IP ์ฃผ์๋ ํด์๋์์ผ๋ ์ค์ TCP ์ฐ๊ฒฐ์ด ์ด๋ฃจ์ด์ง์ง ์์
-
์ด๋ ๋ณดํต ๋ฐฉํ๋ฒฝ ๋ฌธ์ , ๋คํธ์ํฌ ๋ถ๋ฆฌ, ๋๋ Kafka ์ค์ ๋ฌธ์ ๋ก ๋ฐ์
-
- ์ ๊ทผ ๋ฐฉ์ ๋ณ๊ฒฝ: ๋ด๋ถ ํฌํธ์์ ์ธ๋ถ ํฌํธ๋ก
kafka ๋ธ๋ก์ปค์ ์ธ๋ถ ๋ ธ์ถ ํฌํธ(29092, 29093, 29094)๋ฅผ ์ฌ์ฉํ๋๋ก ๋ณ๊ฒฝ
# ๋ณ๊ฒฝ ์
bootstrap_servers = "kafka1:9092,kafka2:9093,kafka3:9094"
# ๋ณ๊ฒฝ ํ
bootstrap_servers = "localhost:29092,localhost:29093,localhost:29094"- ํด๊ฒฐ ์๋ฆฌ
-
๋ด๋ถ ํฌํธ(9092, 9093, 9094):
-
Docker ๋คํธ์ํฌ ๋ด์์๋ง ์ ๊ทผ ๊ฐ๋ฅ
-
์ปจํ ์ด๋ ๊ฐ ์ง์ ํต์ ์ ์ฌ์ฉ
-
-
์ธ๋ถ ํฌํธ(29092, 29093, 29094):
-
ํธ์คํธ ๋จธ์ ์ ํตํด ์ ๊ทผ
-
Docker ์ปจํ ์ด๋ ์ธ๋ถ์์๋ ์ ๊ทผ ๊ฐ๋ฅ
-
localhost๋ก ๋ผ์ฐํ ๋จ
-
- Docker compose์์์ kafka ์ค์ ํ์ธ
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:9092,PLAINTEXT_HOST://localhost:29092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXTADVERTISED_LISTENERS: Kafka๊ฐ ํด๋ผ์ด์ธํธ์๊ฒ ์๋ ค์ฃผ๋ ์ฐ๊ฒฐ ์ ๋ณดLISTENER_SECURITY_PROTOCOL_MAP: ๊ฐ ๋ฆฌ์ค๋์ ๋ณด์ ํ๋กํ ์ฝ ์ง์ INTER_BROKER_LISTENER_NAME: ๋ธ๋ก์ปค ๊ฐ ํต์ ์ ์ฌ์ฉํ ๋ฆฌ์ค๋ ์ง์
-
๊ฐ์ Docker ๋คํธ์ํฌ ๋ด: ์๋น์ค ์ด๋ฆ๊ณผ ๋ด๋ถ ํฌํธ (kafka1:9092)
-
์ธ๋ถ ๋๋ ๋ค๋ฅธ ๋คํธ์ํฌ: localhost์ ์ธ๋ถ ํฌํธ (localhost:29092)
app-py3.12 โ {seilylook} ๐ก ๎ฐ ~/Development/Book/Data_Engineering_with_Python/app ๎ฐ ๎ main ยฑ ๎ฐ make start
=========================
Starting the application...
=========================
python -m src.main
2025-03-20 23:02:45,286 - root - INFO - ๋ฐ์ดํฐ์
์ด ์ด๋ฏธ ์กด์ฌํฉ๋๋ค: data/raw/test_data.csv
2025-03-20 23:02:45,286 - root - INFO - ========================
2025-03-20 23:02:45,286 - root - INFO - Kafka Topic & Message ์์ฑ
2025-03-20 23:02:45,286 - root - INFO - ========================
2025-03-20 23:02:45,303 - src.services.data_streaming - INFO - ํ ํฝ ๊ตฌ๋
์์: users
2025-03-20 23:02:45,303 - src.services.data_streaming - INFO - ๋ฉ์์ง ์๋น ์์...
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Kristina Parker', 'age': 68, 'street': '34674 Miller Overpass', 'city': 'Randallfurt', 'state': 'Maryland', 'zip': 40293, 'lng': 161.665903, 'lat': -87.125185}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Johnathan Lawson', 'age': 19, 'street': '95990 Williams Shore Apt. 829', 'city': 'Webbside', 'state': 'Maine', 'zip': 15543, 'lng': 146.494403, 'lat': -73.700935}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Rose Carpenter', 'age': 68, 'street': '444 Joseph Station', 'city': 'Pattersonside', 'state': 'New Mexico', 'zip': 79242, 'lng': 0.048327, 'lat': 74.385104}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Kimberly Santiago', 'age': 39, 'street': '7635 Peterson Spur Apt. 396', 'city': 'Tinaborough', 'state': 'Nevada', 'zip': 66267, 'lng': -38.278099, 'lat': -36.354147}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Wendy Murphy', 'age': 75, 'street': '35166 Ashlee Mills', 'city': 'Lawsonview', 'state': 'Massachusetts', 'zip': 30520, 'lng': -137.345477, 'lat': 35.262674}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Michael Lin', 'age': 18, 'street': '13086 Hall Pass', 'city': 'East Jay', 'state': 'New York', 'zip': 49686, 'lng': -52.411619, 'lat': -5.883704}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Wesley Watts', 'age': 61, 'street': '4541 Roth Brook Apt. 538', 'city': 'Hensleyland', 'state': 'Maine', 'zip': 70629, 'lng': 137.051209, 'lat': -35.1061065}
2025-03-20 23:02:49,507 - root - INFO - Received: {'name': 'Dennis Wolfe', 'age': 37, 'street': '474 Jones Plaza', 'city': 'Wardville', 'state': 'Minnesota', 'zip': 70795, 'lng': 19.632934, 'lat': -81.602252}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Sharon Chandler', 'age': 21, 'street': '696 Michael Valleys Apt. 412', 'city': 'Lauraton', 'state': 'New Jersey', 'zip': 19419, 'lng': 14.510882, 'lat': 65.1203075}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Amanda Mcmahon', 'age': 34, 'street': '96470 Cobb Hollow', 'city': 'Albertberg', 'state': 'Louisiana', 'zip': 22483, 'lng': -8.723311, 'lat': 27.196991}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Peter Nguyen', 'age': 68, 'street': '15478 Dylan Crescent', 'city': 'North Katrinashire', 'state': 'New Jersey', 'zip': 96223, 'lng': 26.947073, 'lat': -9.097944}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Matthew Robbins', 'age': 43, 'street': '4211 Brittany Field Suite 605', 'city': 'South Rebeccaborough', 'state': 'Delaware', 'zip': 19879, 'lng': 100.065663, 'lat': 54.933101}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Michael Wilcox', 'age': 33, 'street': '018 Leon Alley', 'city': 'Johnmouth', 'state': 'New Mexico', 'zip': 73338, 'lng': -19.245506, 'lat': 26.5704125}
2025-03-20 23:02:49,508 - root - INFO - Received: {'name': 'Amanda Williams', 'age': 75, 'street': '44981 Rebecca Bypass', 'city': 'North Joseph', 'state': 'South Carolina', 'zip': 66529, 'lng': -24.771468, 'lat': 14.545032}













