Karan Kashyap
11/22/2023, 12:33 PMAnkur Huralikoppi
11/23/2023, 11:28 AM$ head -2 bin/supervise
#!/usr/bin/env perl
Tapajit Chandra Paul
11/24/2023, 4:38 AMUtkarsh Chaturvedi
11/27/2023, 12:16 PMChandni
11/29/2023, 3:31 PMAaron Coleman
12/08/2023, 9:26 PMMike Sherman
01/05/2024, 3:34 PMMike Sherman
01/05/2024, 3:36 PMvolumes:
metadata_data: {}
middle_var: {}
historical_var: {}
broker_var: {}
coordinator_var: {}
router_var: {}
druid_shared: {}
datagen_data: {}
and whether I need to create directories ahead of time, what kind of initialization is needed, etc.Charles Smith
01/05/2024, 7:15 PMdocker compose down -v
option to remove the volumes. https://docs.docker.com/storage/volumes/ . @Sergio Ferragut, correct me if I'm wrongMike Sherman
01/06/2024, 4:14 PMarthursherman@ARTHURs-MBP learn-druid % docker compose --profile druid-jupyter up -d
[+] Running 88/8
⠋ datagen 12 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 16.0s
...
✘ middlemanager Error 33.0s
...
Error response from daemon: Get "<https://registry-1.docker.io/v2/imply/druid/manifests/sha256:fb0870d20dd7de803d0ef7d238587e7ced1938f2bcb455e65ac3cf0f3bbdc8a0>": proxyconnect tcp: dial tcp 192.168.65.1:3128: i/o timeout
I re-ran the docker-compose and it's now working. I feel like a kid in a candy store with this docker cluster, thanks team!
https://media.giphy.com/media/HkkvIl42l1OzS/giphy.gif▾
Mike Sherman
01/09/2024, 4:06 PMdatagen
process for clicks seems to be hung, and when you go to the URI http://localhost:9999/file/clicks.json it seems to continue generating the clicks events, even after waiting for more than 2 minutes.Mike Sherman
01/10/2024, 2:25 PM11-joins.ipynb
Jupyter Notebook. I noticed that the users table is loaded with a static timestamp of 1970-01-01. We also loaded a static timestamp for our so-called lookup datasets at my previous employer. Wondering if you all recommend that as a best practice?Mike Sherman
01/10/2024, 4:34 PMREPLACE INTO "clicks_enhanced" OVERWRITE ALL
WITH
"users_ext" AS
(
SELECT *
FROM TABLE(
EXTERN(
'{"type":"http","uris":["<http://datagen:9999/file/users.json>"]}',
'{"type":"json"}'
)
) EXTEND ("time" VARCHAR, "user_id" VARCHAR, "first_name" VARCHAR, "last_name" VARCHAR, "dob" VARCHAR, "address_lat" VARCHAR, "address_long" VARCHAR, "marital_status" VARCHAR, "income" VARCHAR, "signup_ts" VARCHAR)
),
"clicks_ext" AS
(
SELECT *
FROM TABLE(
EXTERN(
'{"type":"http","uris":["<http://datagen:9999/file/clicks.json>"]}',
'{"type":"json"}'
)
) EXTEND ("time" VARCHAR, "user_id" VARCHAR, "event_type" VARCHAR, "client_ip" VARCHAR, "client_device" VARCHAR, "client_lang" VARCHAR, "client_country" VARCHAR, "referrer" VARCHAR, "keyword" VARCHAR, "product" VARCHAR)
)
SELECT
TIME_PARSE(c."time") AS "__time",
c."user_id",
c."event_type",
c."client_ip",
c."client_device",
c."client_lang",
c."client_country",
c."referrer",
c."keyword",
c."product",
u."first_name",
u."last_name",
u."dob",
TIMESTAMPDIFF(YEAR, TIME_PARSE(u."dob"), CURRENT_TIMESTAMP) as age,
ROUND( TIMESTAMPDIFF(YEAR, TIME_PARSE(u."dob"), CURRENT_TIMESTAMP), -1) as age_group,
u."address_lat",
u."address_long",
u."marital_status",
u."income",
u."signup_ts"
FROM "clicks_ext" c LEFT OUTER JOIN "users_ext" u ON c."user_id"=u."user_id"
PARTITIONED BY ALL
Hari Durai Baskar
01/16/2024, 10:45 AMKunal Singh Bisht
01/25/2024, 3:53 PMJosé Ambrosio Hernández Molina
01/27/2024, 3:47 AM{
"type": "cachedNamespace",
"extractionNamespace": {
"type": "uri",
"pollPeriod": "PT1H",
"uriPrefix": "hdfs://*.*.*.*/druid/dim_producto/lookups_producto_nombre/",
"fileRegex": ".*.csv",
"namespaceParseSpec": {
"format": "csv",
"skipHeaderRows": 1,
"hasHeaderRow": true,
"columns": [
"PROD_CODIGO",
"PROD_NOMBRE"
]
}
},
"firstCacheTimeout": 120000,
"injective": true
}
In this HDFS location, I have 4 CSV files named part-01.csv, part-02.csv, part-03.csv, and part-04.csv. Each file contains approximately 60k rows, totaling around 240k rows across all files.
The problem I'm facing is that when I create the lookup, it only loads information from one of the files, resulting in a total row count of only 60k. However, I need to load all 240k rows from all the files. Can someone please help me understand what might be causing this issue?
Thank you in advance for your assistance. Any insights or suggestions would be greatly appreciated.José Ambrosio Hernández Molina
01/30/2024, 8:56 PMINSERT INTO Ventas_plus
SELECT
v."__time",
-- Product Dimensions
LOOKUP(id_producto, 'dim_product_id') as id_product,
LOOKUP(id_producto, 'dim_product_name') as product_name,
LOOKUP(id_producto, 'dim_product_category') as product_category,
-- Point of Sale Dimensions
LOOKUP(id_location, 'dim_location_address') as "location_address",
LOOKUP(id_location, 'dim_location_locality') as "location_locality",
LOOKUP(id_location, 'dim_location_longitude') as "location_longitude",
LOOKUP(id_location, 'dim_location_latitude') as "location_latitude",
-- Sales data
v.quantity,
v.sum_sales,
v.sum_amount,
v.sum_net_amount
FROM "sales" v
PARTITIONED BY DAY
This ingestion is giving me the following error:
Error in the middleManager Task execution process exited unsuccessfully with code[3]. See middleManager logs for more details.
The ingestion logs show: Terminating due to java.lang.OutOfMemoryError: GC overhead limit exceeded, and for more information, within the query_detail-xxxx-archive.json of the query, I have these memory metrics:
"memory": { "maxMemory": 1073741824, "totalMemory": 1073741824, "freeMemory": 460467936, "usedMemory": 613273888, "directMemory": 134217728 }
In query format, the select with lookups and automatic pagination from Druid works correctly. The issue arises during ingestion. My question is how to solve this error, how to identify why the Java Garbage Collector (GC) is taking excessive time to collect garbage and recovering very little memory in each cycle, and if increasing the memory assigned to the JVM is one of the solutions, what calculation should be performed to prevent this error.Ashutosh Shaha
01/31/2024, 11:11 AMAru Raghuwanshi
02/02/2024, 3:41 AMDivit Bui
02/02/2024, 6:05 PMAmos Bird
02/06/2024, 7:07 AMJosé Ambrosio Hernández Molina
02/06/2024, 8:55 PMNoor
02/07/2024, 10:25 AMАлексей Ясинский
02/25/2024, 10:50 AM钱智熠
02/27/2024, 6:02 AMKiran Kumar Puttakota
02/29/2024, 8:13 AMNoor
03/05/2024, 11:35 AMRachid Cherqaoui
03/08/2024, 2:34 PMLev Zelkind
03/19/2024, 3:41 PMNoor
03/20/2024, 5:04 PM