Car buying is fun with Snowflake Part 1

Published in

Analytics Vidhya

5 min readNov 3, 2020

Like everyone else, I found car buying process complex and time consuming. Mainly so because one has to visit many dealer’s website for inventory to look for the perfect car with all the desired options.

After looking at few of the dealers website, I realized that all of them looked alike. I wondered if writing a web scrapper in Python could help me query all the dealers’ inventory in a single go.

Based on the interactions with filters, it seems like websites were using REST APIs to load new data. So I opened browser’s developer tool to search for any REST endpoint that can be leveraged by the scrapper.

Looking at the endpoint it became clear that Dealer websites are built on WordPress are using VMS (I assume it is an abbreviation of Vehicle Management System) plugin that holds all data in a central database with a corresponding “cp” Id. By randomly updating the value of cp in the endpoint url in browser address bar, it fetched JSON data representing inventory for that particular dealer.

https://vms.prod.convertus.rocks/api/filtering/?cp=1000&ln=en&pg=1&pc=15

Let’s see this JSON data using Notepad++ JSON viewer plugin to get an idea of the structure. Looking at the result count, it was easy to make a correlation with “pc” value in the query string.

Each record here represents a particular vehicle data such as VIN, Make, Model, price etc.

While it is not a good idea to use sequential numeric Ids that are easy to guess and iterate over in APIs, in this case they make it easy to get data from many dealers using a loop with a few lines of Python code and dump API output in a JSON file.

import requestsfor x in range(1000,2700):
    url = "https://vms.prod.convertus.rocks/api/filtering/?cp={0}&ln=en&pg=1&pc=5000".format(x)
    
    print(url) 
    try:
        resp = requests.get(url)
        f = open("vms_{0}.json".format(x), "w")
        f.write(resp.text)
    finally:
        f.close()

It creates 1 file for each of the requests in the loop.

In Snowflake, let’s create a DB, Table, Format and Internal Named Stage to ingest JSON files .

//Create database and a table to hold this row data
CREATE DATABASE VMS COMMENT = 'Vehicle Management System';
CREATE TABLE "VMS"."PUBLIC"."VMS_STAGE_1" ("JSON_DATA" VARIANT) COMMENT = 'Stage level 1 for VMS';//Create a format to ingest JSON VMS files
CREATE FILE FORMAT "VMS"."PUBLIC".vms_json_format TYPE = 'JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = FALSE ALLOW_DUPLICATE = FALSE STRIP_OUTER_ARRAY = FALSE STRIP_NULL_VALUES = FALSE IGNORE_UTF8_ERRORS = FALSE COMMENT = 'Format to ingest VMS JSON files';//Create an Internal Named Stage to upload data into Snowflake
CREATE STAGE "VMS"."PUBLIC".vms_stage COMMENT = 'Stage to upload VMS JSON files for ingestion';

Next step is to upload JSON files using SnowSQL tool. Open CMD / Terminal, navigate to the directory where JSON files are downloaded and start SNOWSQL. Once logged in: trigger following command

//Upload all *.json files to stage
put file://*.json @VMS_STAGE;

It takes time to execute as there are many files to zip and encrypt before uploading them to Snowflake stage

Time to copy data from uploaded files into level 1 stage table

copy into vms.public.vms_stage_1 from @vms_stage file_format=(format_name=VMS_JSON_FORMAT);

Took about 15 seconds to insert all the data, I think its fast considering it was done using XS size warehouse.

Again, time to validate how data looks in stage table

Remember that each row contains data for all the vehicles at a specific dealer. So to list out each individual vehicle, it needs flattening of records node.

select value:vin::text as vin,
value:make::text as make, value:model::text as model, value:retail_price::double as retail_price
from vms.public.vms_stage_1
,LATERAL FLATTEN(input => json_data:results);

Create a view that will parse “interesting” attributes of each vehicle as a separate column

create or replace view vms_stage_2 as
select value:ad_id::text as id,value:stock_number::text as stock_number, value:vin::text as vin, value:days_on_lot::int as days_on_lot, value:sale_class::text as sale_class,value:demo::text as demo,
value:sale_expiry::text as sale_expiry, value:vehicle_class::text as vehicle_class, value:year::text as year,
value:make::text as make, value:model::text as model, value:trim::text as trim, value:passenger::text as passenger, 
value:retail_price::double as retail_price,value:lowest_price::double as lowest_price, value:asking_price::double as asking_price, value:internet_price::double as internet_price, value:final_price::double as final_price,
value:wholesale_price::double as wholesale_price, value:sales_tax::double as sales_tax,
value:odometer::int as odometer, value:fuel_type::text as fuel_type, value:transmission::text as transmission, value:engine::text as "ENGINE", value:drive_train::text as drive_train, value:doors::text as doors, 
value:exterior_color::text as exterior_color, value:vdp_url::text as vdp_url, value:image:image_original::text as image_original, value:manu_program::text as manu_program, 
value:manu_exterior_color::text as manu_exterior_color,
value:body_style::text as body_style, value:certified::int as is_certified, value:company_data:company_name::text as company_name,
value:company_data:company_city::text as company_city, value:company_data:company_province::text as company_province, value:company_data:company_sales_email::text as company_sales_email, 
value:company_data:company_sales_phone::text as company_sales_phone 
from vms.public.vms_stage_1
,LATERAL FLATTEN(input => json_data:results);

Such that when executing this view, it returns best available price.

I love how easy it is to parse JSON with Snowflake syntax and the performance it offers. Next, I will look into automating this workflow using Streams.

This is my 1st time posting on Medium. Hope you enjoyed this write up.

Part 2 is published. It is all about automating this workflow.

Car buying is fun with Snowflake Part 1

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Parag Shah

No responses yet