App Instrumentation โ€” Data Collection & Pipeline

Space: Data Engineering Owner: [Your Name] Status: In Progress Last Updated: [Date]
Overview: This page documents the end-to-end app instrumentation process for 3 apps. It covers how crawl data is collected, how it is extracted locally using the CMv3 API, and how the resulting product data is pushed to Kafka via a scheduled crontab job.

Jump to app run steps

๐Ÿ‡ฆ๐Ÿ‡บ Woolworths_app-AU Starts 4:30 AM · ~2 hrs total ๐Ÿ‡ฎ๐Ÿ‡ฉ Tokopedia_app-ID Starts 8:30 AM · ~2.5 hrs total ๐Ÿ‡ฆ๐Ÿ‡ช Talabat_app-AE ~2 hrs total

1. Scope

App instrumentation is being set up for the following 3 applications:

#App NameRegionCrawl Start TimeStatus
1Woolworths_app-AU๐Ÿ‡ฆ๐Ÿ‡บ Australia4:30 AMActive
2Tokopedia_app-ID๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesia8:30 AMActive
3Talabat_app-AE๐Ÿ‡ฆ๐Ÿ‡ช UAEโ€”Active

2. High-Level Architecture

The instrumentation pipeline consists of five sequential stages:

  1. App Instrumentation Crawls โ€” Run crawl jobs per app to capture XML dumps and URL screenshots.
  2. Seed File Ingestion โ€” Load the seed file containing meta_info and seed_url for each app.
  3. Local Data Extraction via CMv3 โ€” Use extractor.py to fetch regex configs from the CMv3 API and apply them on the XML dumps to extract product data.
  4. JSON Output โ†’ Workspace Push โ€” The extraction output (a JSON file mapping keywords/categories to products) is pushed to the workspace.
  5. Crontab โ†’ Kafka Push โ€” A scheduled crontab job monitors the workspace. When a new file is detected in today's data folder, data is pushed to Kafka.
Note: Refer to the process flow diagram in the Confluence diagram macro or the attached architecture image for a visual overview of these stages.

3. App Crawl Run Instructions

The following sections describe the exact steps to run the instrumentation crawl for each app. Each app requires two terminals โ€” one for Appium and one for running the Python scripts.

Before starting: Ensure Appium is installed and the target device/emulator is connected and ready. Python environment must be set up under App_Instrumentation/App_env/.
๐Ÿ‡ฆ๐Ÿ‡บ

Woolworths_app-AU

Australia ยท Grocery & Supermarket App

⏰ Start time: 4:30 AM
1

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium
2

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate
3

Run search crawl ~1 hour 30 min

In Terminal 2, run the search script. This crawls search result pages and collects XML dumps and screenshots.

python woolworth_app_search.py

Wait for this to complete fully before proceeding to the next step.

4

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script to collect product detail page data.

python woolworth_app_listing.py
๐Ÿ‡ฎ๐Ÿ‡ฉ

Tokopedia_app-ID

Indonesia ยท E-commerce App

⏰ Start time: 8:30 AM
1

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium
2

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate
3

Run search crawl ~2 hours

In Terminal 2, run the search script with the Tokopedia seed file as the argument.

python tokopedia_app_search.py tokopedia_search_seed.txt

Wait for this to complete fully before proceeding to the next step.

4

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script.

python tokopedia_app_listing.py
๐Ÿ‡ฆ๐Ÿ‡ช

Talabat_app-AE

UAE ยท Food Delivery App

1

Terminal 1 Start Appium server

Open a terminal and launch the Appium server. Keep this terminal running throughout the entire session.

appium
2

Terminal 2 Activate Python environment

Open a second terminal and navigate to the project directory, then activate the virtual environment.

cd App_Instrumentation
source App_env/bin/activate
3

Run search crawl ~1 hour 30 min

In Terminal 2, run the search script with the Talabat search seed file as the argument.

python talabat_app_search.py talabat_search_seed_file.txt

Wait for this to complete fully before proceeding to the next step.

4

Run listing crawl ~30 min

Once the search crawl has completed, run the listing script with the Talabat listing seed file.

python talabat_app_listing.py talabat_listing_seed_file.txt

4. Detailed Process Steps

1
App Instrumentation Crawls

Crawl jobs are run for each of the 3 apps to instrument their pages and collect raw data.

What is collected:

Output location: [path/to/crawl/output/]

The crawl is configured per app using the seed file (see Step 2). The frequency and scope of crawls can be adjusted based on the app's update cycle.

2
Seed File Ingestion

Each app has a corresponding seed file that drives the crawl and extraction process. The seed file contains two key fields:

FieldDescriptionExample
seed_urlThe starting URL(s) for the crawl. The crawler begins traversal from these URLs.https://app.example.com/category/shoes
meta_infoMetadata associated with the seed URL โ€” such as category name, keyword, app identifier, and any configuration flags.{"category": "shoes", "keyword": "running shoes"}

The seed file is loaded at the start of both the crawl and the extraction phases. It acts as the source of truth for what data to collect and how to label it.

Seed file location: [path/to/seed_file]

3
Local Data Extraction via CMv3

This step transforms the raw XML dumps into structured product data using the CMv3 extraction framework. The process runs locally and involves the following sub-steps:

  1. Fetch regex configuration from CMv3 API โ€” extractor.py makes an API call to the CMv3 service to retrieve the regex patterns associated with the app. These patterns define how product fields (title, price, URL, etc.) are identified within the XML.
  2. Apply regex on XML dump โ€” The regex patterns are applied on the collected XML dump for each seed_url. This parses out the relevant product attributes from the raw markup.
  3. Output JSON file โ€” The extracted data is written to a JSON file. The file contains all products grouped by keyword or category, as defined in the seed file's meta_info.

Key Component: extractor.py

AttributeDetail
File pathcrawler/extractor.py
RoleOrchestrates the extraction โ€” calls CMv3 for config, applies regex, writes JSON
InputXML dump file path, seed file path
OutputJSON file with product data per keyword/category
API dependencyCMv3 API (regex config endpoint)

Sample JSON Output Structure

{
  "keyword": "running shoes",
  "category": "shoes",
  "app": "App 1",
  "products": [
    {
      "title": "Nike Air Zoom",
      "price": "โ‚น5,499",
      "url": "https://app.example.com/product/nike-air-zoom",
      "rank": 1
    },
    ...
  ]
}
4
JSON Output โ†’ Workspace Push

Once extraction is complete, the output JSON file is pushed to the workspace. The workspace acts as a staging area where files are organized by date before being consumed by the Kafka pipeline.

Directory structure in workspace:

workspace/
โ””โ”€โ”€ data/
    โ””โ”€โ”€ YYYY-MM-DD/          โ† today's data folder
        โ”œโ”€โ”€ app1_keyword1.json
        โ”œโ”€โ”€ app1_keyword2.json
        โ”œโ”€โ”€ app2_keyword1.json
        โ””โ”€โ”€ ...

Files are named to reflect the app and keyword/category for easy identification and traceability.

5
Crontab โ†’ Kafka Pipeline

A crontab job is configured in the workspace environment to periodically run commands that push data to Kafka. The job is triggered whenever a new JSON file lands in today's data folder.

How it works

Crontab Configuration (example)

# Run every 15 minutes โ€” push new JSON files to Kafka
*/15 * * * * /path/to/scripts/kafka_push.sh /workspace/data/$(date +\%Y-\%m-\%d)/

Kafka Topic Mapping

AppKafka TopicNotes
App 1[topic_name]Add details
App 2[topic_name]Add details
App 3[topic_name]Add details

5. Data Flow Summary

StageInputProcessOutput
Crawlseed_urlApp instrumentation crawl jobXML dump, URL screenshots
Seed ingestionSeed fileParse meta_info + seed_urlCrawl config + extraction context
ExtractionXML dump + CMv3 regex configextractor.py applies regexJSON file (products per keyword)
Workspace pushJSON fileCopy to today's dated folderFile in workspace/data/YYYY-MM-DD/
Kafka pushJSON file in today's folderCrontab triggers push commandData published to Kafka topic

6. Dependencies & Tools

ComponentDescriptionOwner / Docs
CMv3 APIProvides regex extraction configuration per app[Link to CMv3 docs]
crawler/extractor.pyCore extraction script โ€” reads XML, applies regex, writes JSON[Link to repo]
Seed fileDefines crawl scope (seed_url + meta_info) per app[Link to seed file location]
WorkspaceShared storage where JSON files are staged per date[Link to workspace]
KafkaMessage queue receiving the extracted product data[Link to Kafka setup]
CrontabScheduler for automated Kafka push commands[Server / environment]

7. Open Items & TODOs

#ItemOwnerStatus
1Fill in app names and descriptions in the Scope tablePending
2Confirm crontab schedule and add exact cron expressionPending
3Add Kafka topic names per appPending
4Attach architecture diagram to this pagePending
5Document error handling and retry logic in extractor.pyPending

Page maintained by the Data Engineering team. For questions, reach out via [Slack channel].