Jump to app run steps
App instrumentation is being set up for the following 3 applications:
| # | App Name | Region | Crawl Start Time | Status |
|---|---|---|---|---|
| 1 | Woolworths_app-AU | ๐ฆ๐บ Australia | 4:30 AM | Active |
| 2 | Tokopedia_app-ID | ๐ฎ๐ฉ Indonesia | 8:30 AM | Active |
| 3 | Talabat_app-AE | ๐ฆ๐ช UAE | โ | Active |
The instrumentation pipeline consists of five sequential stages:
meta_info and seed_url for each app.extractor.py to fetch regex configs from the CMv3 API and apply them on the XML dumps to extract product data.The following sections describe the exact steps to run the instrumentation crawl for each app. Each app requires two terminals โ one for Appium and one for running the Python scripts.
App_Instrumentation/App_env/.
Woolworths_app-AU
Australia ยท Grocery & Supermarket App
Terminal 1 Start Appium server
Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.
appium
Terminal 2 Activate Python environment
Open a second terminal and navigate to the project directory, then activate the virtual environment.
cd App_Instrumentation
source App_env/bin/activate
Run search crawl ~1 hour 30 min
In Terminal 2, run the search script. This crawls search result pages and collects XML dumps and screenshots.
python woolworth_app_search.py
Wait for this to complete fully before proceeding to the next step.
Run listing crawl ~30 min
Once the search crawl has completed, run the listing script to collect product detail page data.
python woolworth_app_listing.py
Tokopedia_app-ID
Indonesia ยท E-commerce App
Terminal 1 Start Appium server
Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.
appium
Terminal 2 Activate Python environment
Open a second terminal and navigate to the project directory, then activate the virtual environment.
cd App_Instrumentation
source App_env/bin/activate
Run search crawl ~2 hours
In Terminal 2, run the search script with the Tokopedia seed file as the argument.
python tokopedia_app_search.py tokopedia_search_seed.txt
Wait for this to complete fully before proceeding to the next step.
Run listing crawl ~30 min
Once the search crawl has completed, run the listing script.
python tokopedia_app_listing.py
Talabat_app-AE
UAE ยท Food Delivery App
Terminal 1 Start Appium server
Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.
appium
Terminal 2 Activate Python environment
Open a second terminal and navigate to the project directory, then activate the virtual environment.
cd App_Instrumentation
source App_env/bin/activate
Run search crawl ~1 hour 30 min
In Terminal 2, run the search script with the Talabat search seed file as the argument.
python talabat_app_search.py talabat_search_seed_file.txt
Wait for this to complete fully before proceeding to the next step.
Run listing crawl ~30 min
Once the search crawl has completed, run the listing script with the Talabat listing seed file.
python talabat_app_listing.py talabat_listing_seed_file.txt
Crawl jobs are run for each of the 3 apps to instrument their pages and collect raw data.
What is collected:
Output location: [path/to/crawl/output/]
The crawl is configured per app using the seed file (see Step 2). The frequency and scope of crawls can be adjusted based on the app's update cycle.
Each app has a corresponding seed file that drives the crawl and extraction process. The seed file contains two key fields:
| Field | Description | Example |
|---|---|---|
seed_url | The starting URL(s) for the crawl. The crawler begins traversal from these URLs. | https://app.example.com/category/shoes |
meta_info | Metadata associated with the seed URL โ such as category name, keyword, app identifier, and any configuration flags. | {"category": "shoes", "keyword": "running shoes"} |
The seed file is loaded at the start of both the crawl and the extraction phases. It acts as the source of truth for what data to collect and how to label it.
Seed file location: [path/to/seed_file]
This step transforms the raw XML dumps into structured product data using the CMv3 extraction framework. The process runs locally and involves the following sub-steps:
extractor.py makes an API call to the CMv3 service to retrieve the regex patterns associated with the app. These patterns define how product fields (title, price, URL, etc.) are identified within the XML.seed_url. This parses out the relevant product attributes from the raw markup.meta_info.| Attribute | Detail |
|---|---|
| File path | crawler/extractor.py |
| Role | Orchestrates the extraction โ calls CMv3 for config, applies regex, writes JSON |
| Input | XML dump file path, seed file path |
| Output | JSON file with product data per keyword/category |
| API dependency | CMv3 API (regex config endpoint) |
{
"keyword": "running shoes",
"category": "shoes",
"app": "App 1",
"products": [
{
"title": "Nike Air Zoom",
"price": "โน5,499",
"url": "https://app.example.com/product/nike-air-zoom",
"rank": 1
},
...
]
}
Once extraction is complete, the output JSON file is pushed to the workspace. The workspace acts as a staging area where files are organized by date before being consumed by the Kafka pipeline.
Directory structure in workspace:
workspace/
โโโ data/
โโโ YYYY-MM-DD/ โ today's data folder
โโโ app1_keyword1.json
โโโ app1_keyword2.json
โโโ app2_keyword1.json
โโโ ...
Files are named to reflect the app and keyword/category for easy identification and traceability.
A crontab job is configured in the workspace environment to periodically run commands that push data to Kafka. The job is triggered whenever a new JSON file lands in today's data folder.
workspace/data/YYYY-MM-DD/ folder for new or unprocessed files.# Run every 15 minutes โ push new JSON files to Kafka
*/15 * * * * /path/to/scripts/kafka_push.sh /workspace/data/$(date +\%Y-\%m-\%d)/
| App | Kafka Topic | Notes |
|---|---|---|
| App 1 | [topic_name] | Add details |
| App 2 | [topic_name] | Add details |
| App 3 | [topic_name] | Add details |
| Stage | Input | Process | Output |
|---|---|---|---|
| Crawl | seed_url | App instrumentation crawl job | XML dump, URL screenshots |
| Seed ingestion | Seed file | Parse meta_info + seed_url | Crawl config + extraction context |
| Extraction | XML dump + CMv3 regex config | extractor.py applies regex | JSON file (products per keyword) |
| Workspace push | JSON file | Copy to today's dated folder | File in workspace/data/YYYY-MM-DD/ |
| Kafka push | JSON file in today's folder | Crontab triggers push command | Data published to Kafka topic |
| Component | Description | Owner / Docs |
|---|---|---|
| CMv3 API | Provides regex extraction configuration per app | [Link to CMv3 docs] |
crawler/extractor.py | Core extraction script โ reads XML, applies regex, writes JSON | [Link to repo] |
| Seed file | Defines crawl scope (seed_url + meta_info) per app | [Link to seed file location] |
| Workspace | Shared storage where JSON files are staged per date | [Link to workspace] |
| Kafka | Message queue receiving the extracted product data | [Link to Kafka setup] |
| Crontab | Scheduler for automated Kafka push commands | [Server / environment] |
| # | Item | Owner | Status |
|---|---|---|---|
| 1 | Fill in app names and descriptions in the Scope table | Pending | |
| 2 | Confirm crontab schedule and add exact cron expression | Pending | |
| 3 | Add Kafka topic names per app | Pending | |
| 4 | Attach architecture diagram to this page | Pending | |
| 5 | Document error handling and retry logic in extractor.py | Pending |
Page maintained by the Data Engineering team. For questions, reach out via [Slack channel].