App Instrumentation — Data Collection & Pipeline

Space: Data Engineering Owner: [Your Name] Status: In Progress Last Updated: [Date]

Overview: This page documents the end-to-end app instrumentation process for 3 apps. It covers how crawl data is collected, how it is extracted locally using the CMv3 API, and how the resulting product data is pushed to Kafka via a scheduled crontab job.

Jump to app run steps

🇦🇺 Woolworths_app-AU Starts 4:30 AM · ~2 hrs total 🇮🇩 Tokopedia_app-ID Starts 8:30 AM · ~2.5 hrs total 🇦🇪 Talabat_app-AE ~2 hrs total

1. Scope

App instrumentation is being set up for the following 3 applications:

#	App Name	Region	Crawl Start Time	Status
1	Woolworths_app-AU	🇦🇺 Australia	4:30 AM	Active
2	Tokopedia_app-ID	🇮🇩 Indonesia	8:30 AM	Active
3	Talabat_app-AE	🇦🇪 UAE	—	Active

2. High-Level Architecture

The instrumentation pipeline consists of five sequential stages:

App Instrumentation Crawls — Run crawl jobs per app to capture XML dumps and URL screenshots.
Seed File Ingestion — Load the seed file containing meta_info and seed_url for each app.
Local Data Extraction via CMv3 — Use extractor.py to fetch regex configs from the CMv3 API and apply them on the XML dumps to extract product data.
JSON Output → Workspace Push — The extraction output (a JSON file mapping keywords/categories to products) is pushed to the workspace.
Crontab → Kafka Push — A scheduled crontab job monitors the workspace. When a new file is detected in today's data folder, data is pushed to Kafka.

Note: Refer to the process flow diagram in the Confluence diagram macro or the attached architecture image for a visual overview of these stages.

3. App Crawl Run Instructions

The following sections describe the exact steps to run the instrumentation crawl for each app. Each app requires two terminals — one for Appium and one for running the Python scripts.

Before starting: Ensure Appium is installed and the target device/emulator is connected and ready. Python environment must be set up under App_Instrumentation/App_env/.

🇦🇺

Woolworths_app-AU

Australia · Grocery & Supermarket App

⏰ Start time: 4:30 AM

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate

Run search crawl ~1 hour 30 min

In Terminal 2, run the search script. This crawls search result pages and collects XML dumps and screenshots.

python woolworth_app_search.py

Wait for this to complete fully before proceeding to the next step.

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script to collect product detail page data.

python woolworth_app_listing.py

🇮🇩

Tokopedia_app-ID

Indonesia · E-commerce App

⏰ Start time: 8:30 AM

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate

Run search crawl ~2 hours

In Terminal 2, run the search script with the Tokopedia seed file as the argument.

python tokopedia_app_search.py tokopedia_search_seed.txt

Wait for this to complete fully before proceeding to the next step.

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script.

python tokopedia_app_listing.py

🇦🇪

Talabat_app-AE

UAE · Food Delivery App

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate

Run search crawl ~1 hour 30 min

In Terminal 2, run the search script with the Talabat search seed file as the argument.

python talabat_app_search.py talabat_search_seed_file.txt

Wait for this to complete fully before proceeding to the next step.

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script with the Talabat listing seed file.

python talabat_app_listing.py talabat_listing_seed_file.txt

4. Detailed Process Steps

App Instrumentation Crawls

Crawl jobs are run for each of the 3 apps to instrument their pages and collect raw data.

What is collected:

XML Dumps — Raw XML response data from the app's pages, capturing product listings, categories, and metadata.
Screenshots — Visual screenshots of the crawled URLs, used for validation and QA purposes.

Output location: [path/to/crawl/output/]

The crawl is configured per app using the seed file (see Step 2). The frequency and scope of crawls can be adjusted based on the app's update cycle.

Seed File Ingestion

Each app has a corresponding seed file that drives the crawl and extraction process. The seed file contains two key fields:

Field	Description	Example
`seed_url`	The starting URL(s) for the crawl. The crawler begins traversal from these URLs.	`https://app.example.com/category/shoes`
`meta_info`	Metadata associated with the seed URL — such as category name, keyword, app identifier, and any configuration flags.	`{"category": "shoes", "keyword": "running shoes"}`

The seed file is loaded at the start of both the crawl and the extraction phases. It acts as the source of truth for what data to collect and how to label it.

Seed file location: [path/to/seed_file]

Local Data Extraction via CMv3

This step transforms the raw XML dumps into structured product data using the CMv3 extraction framework. The process runs locally and involves the following sub-steps:

Fetch regex configuration from CMv3 API — extractor.py makes an API call to the CMv3 service to retrieve the regex patterns associated with the app. These patterns define how product fields (title, price, URL, etc.) are identified within the XML.
Apply regex on XML dump — The regex patterns are applied on the collected XML dump for each seed_url. This parses out the relevant product attributes from the raw markup.
Output JSON file — The extracted data is written to a JSON file. The file contains all products grouped by keyword or category, as defined in the seed file's meta_info.

Key Component: extractor.py

Attribute	Detail
File path	`crawler/extractor.py`
Role	Orchestrates the extraction — calls CMv3 for config, applies regex, writes JSON
Input	XML dump file path, seed file path
Output	JSON file with product data per keyword/category
API dependency	CMv3 API (regex config endpoint)

Sample JSON Output Structure

{
  "keyword": "running shoes",
  "category": "shoes",
  "app": "App 1",
  "products": [
    {
      "title": "Nike Air Zoom",
      "price": "₹5,499",
      "url": "https://app.example.com/product/nike-air-zoom",
      "rank": 1
    },
    ...
  ]
}

JSON Output → Workspace Push

Once extraction is complete, the output JSON file is pushed to the workspace. The workspace acts as a staging area where files are organized by date before being consumed by the Kafka pipeline.

Directory structure in workspace:

workspace/
└── data/
    └── YYYY-MM-DD/          ← today's data folder
        ├── app1_keyword1.json
        ├── app1_keyword2.json
        ├── app2_keyword1.json
        └── ...

Files are named to reflect the app and keyword/category for easy identification and traceability.

Crontab → Kafka Pipeline

A crontab job is configured in the workspace environment to periodically run commands that push data to Kafka. The job is triggered whenever a new JSON file lands in today's data folder.

How it works

The crontab runs at a defined interval (e.g., every 15 minutes or hourly — update with actual schedule).
It scans the workspace/data/YYYY-MM-DD/ folder for new or unprocessed files.
For each new file detected, the corresponding Kafka push command is executed.
Data is published to the Kafka topic associated with the app and category.

Crontab Configuration (example)

# Run every 15 minutes — push new JSON files to Kafka
*/15 * * * * /path/to/scripts/kafka_push.sh /workspace/data/$(date +\%Y-\%m-\%d)/

Kafka Topic Mapping

App	Kafka Topic	Notes
App 1	`[topic_name]`	Add details
App 2	`[topic_name]`	Add details
App 3	`[topic_name]`	Add details

5. Data Flow Summary

Stage	Input	Process	Output
Crawl	seed_url	App instrumentation crawl job	XML dump, URL screenshots
Seed ingestion	Seed file	Parse meta_info + seed_url	Crawl config + extraction context
Extraction	XML dump + CMv3 regex config	`extractor.py` applies regex	JSON file (products per keyword)
Workspace push	JSON file	Copy to today's dated folder	File in `workspace/data/YYYY-MM-DD/`
Kafka push	JSON file in today's folder	Crontab triggers push command	Data published to Kafka topic

6. Dependencies & Tools

Component	Description	Owner / Docs
CMv3 API	Provides regex extraction configuration per app	[Link to CMv3 docs]
`crawler/extractor.py`	Core extraction script — reads XML, applies regex, writes JSON	[Link to repo]
Seed file	Defines crawl scope (seed_url + meta_info) per app	[Link to seed file location]
Workspace	Shared storage where JSON files are staged per date	[Link to workspace]
Kafka	Message queue receiving the extracted product data	[Link to Kafka setup]
Crontab	Scheduler for automated Kafka push commands	[Server / environment]

7. Open Items & TODOs

#	Item	Status
1	Fill in app names and descriptions in the Scope table	Pending
2	Confirm crontab schedule and add exact cron expression	Pending
3	Add Kafka topic names per app	Pending
4	Attach architecture diagram to this page	Pending
5	Document error handling and retry logic in extractor.py	Pending

Page maintained by the Data Engineering team. For questions, reach out via [Slack channel].