Hi, team. My datahub version is v0.10.2. I try t...
# troubleshoot
i
Hi, team. My datahub version is v0.10.2. I try to open Queries using
DatasetUsageStatisticsClass
. I can see Queries tap and it works well in test (I emitted to about 30 urns.). But after I emit DatasetUsageStatisticsClass to about 4k urns, when I click data source, after about 10 seconds of loading and I got the error “An unknown error occurred. (code 500)” and the page looks like the picture I attached. Even for data sources where I don’t have the queries tab open. Is there a limit to how many queries taps I can have open? Or I set the max length of each query I emitted in the Queries tab to 10000, is there a limit to length of each query?
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: 1️⃣ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? Yes button 2️⃣ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? Yes button
d
Hi Jiyun - How are you deploying datahub? I'm wondering if the resource of each pod is not exceeded during the ingestion. Also, could you post the log of datahub-gms?
i
I’m deploying it using docker-compose. After ingestion and run
datahub docker check
, output is
No issues detected
. Below is the log of datahub-gms.
d
Seems like this is the issue .... can you share your recipe file?
Copy code
.","caused_by":{"type":"illegal_state_exception","reason":"unexpected docvalues type NONE for field 'topSqlQueries' (expected one of [SORTED, SORTED_SET]). Re-index with correct docvalues type."}}}}]},"status":500}
i
I emitted data using this python code.
Copy code
def _emit_to_datahub(queries, query_count: int, db_name: str, tb_name: str):
    if queries and db_name and tb_name:
        top_sql_queries = [
            trim_query(
                query,
                budget_per_query=10000,
            )
            for query in queries
        ]

        usageStats = DatasetUsageStatisticsClass(
            timestampMillis=get_sys_time(),
            eventGranularity=TimeWindowSizeClass(unit=CalendarIntervalClass.DAY, multiple=1),
            totalSqlQueries=query_count,
            topSqlQueries=top_sql_queries,
        )

        mcp = MetadataChangeProposalWrapper(
                entityType="dataset",
                aspectName="datasetUsageStatistics",
                changeType=ChangeTypeClass.UPSERT,
                entityUrn=f'urn:li:dataset:(urn:li:dataPlatform:glue,{db_name}.{tb_name},PROD)',
                aspect=usageStats,
            )

        # Emit metadata
        emitter.emit(item=mcp)
d
Thank you for sharing - this seems to be an issue on our side - we'll get back to you! @gentle-hamburger-31302
i
Hey, I found the cause. Some of my top_sql_queries’s length is over 32766, so elasticsearch failed to bulk. After I fixed to top_sql_queries’s length under 32766, everything works well.
d
Glad you figured out!
b
hey @important-afternoon-19755. What do you mean by you fixed top_sql_queries’s length? Is that a Datahub setting or?
w
I setup DataHub locally yesterday for the first time. Ingested one BigQuery project without issue. On ingesting a second projects I now see this message in the logs and any attempt to view a dataset (any dataset). The front-end gives the "Something went wrong, Error500" message. Is this a known issue? It's a very early roadblock for me.
The exact msg is:
{"error":{"root_cause":[{"type":"exception","reason":"java.util.concurrent.ExecutionException: java.lang.IllegalStateException: unexpected docvalues type NONE for field 'topSqlQueries' (expected one of [SORTED, SORTED_SET])
i
@bland-lighter-26751 Oh, sorry. My slack alarm is off so I saw your message now 😂
When I clicked on data source, I saw that the gms container logged an error related to elasticsearch, as shown in the comments of the thread. I assumed that there was a problem when ingesting the topSqlQueries and that elasticsearch was not able to index it. When I checked the logs of the gms container again when ingesting the topSqlQueries, I found the following error and realized that it was caused by a limit on the length of the topSqlQueries.
Copy code
2023-05-12 14:25:12,148 [I/O dispatcher 1] ERROR c.l.m.s.e.update.BulkListener - Failed to feed bulk request. Number of events: 21 Took time ms: -1 Message: failure in bulk execution:
[7]: index [dataset_datasetusagestatisticsaspect_v1], type [_doc], id [25e23835a2de64beff172907fc73c967], message [ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Document contains at least one immense term in field="topSqlQueries" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[91, 34, 83, 69, 76, 69, 67, 84, 92, 110, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 116, 97]...', original message: bytes can be at most 32766 in length; got 33502]]; nested: ElasticsearchException[Elasticsearch exception [type=max_bytes_length_exceeded_exception, reason=bytes can be at most 32766 in length; got 33502]];]
[8]: index [dataset_datasetusagestatisticsaspect_v1], type [_doc], id [5b70b999469bf8232731c1fb3a7e8a9d], message [ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Document contains at least one immense term in field="topSqlQueries" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[91, 34, 83, 69, 76, 69, 67, 84, 92, 110, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 116, 97]...', original message: bytes can be at most 32766 in length; got 51237]]; nested: ElasticsearchException[Elasticsearch exception [type=max_bytes_length_exceeded_exception, reason=bytes can be at most 32766 in length; got 51237]];]
[9]: index [dataset_datasetusagestatisticsaspect_v1], type [_doc], id [5ab06f02d23bfc976a63bdccd605d465], message [ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Document contains at least one immense term in field="topSqlQueries" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[91, 34, 83, 69, 76, 69, 67, 84, 92, 110, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 116, 97]...', original message: bytes can be at most 32766 in length; got 56755]]; nested: ElasticsearchException[Elasticsearch exception [type=max_bytes_length_exceeded_exception, reason=bytes can be at most 32766 in length; got 56755]];]
I'm using DatasetUsageStatisticsClass. It can be used like this:
Copy code
DatasetUsageStatisticsClass(
    timestampMillis=get_sys_time(),
    eventGranularity=TimeWindowSizeClass(unit=CalendarIntervalClass.DAY, multiple=1),
    totalSqlQueries=query_count,
    topSqlQueries=top_sql_queries,
)
So, I fixed length of top_sql_queries under 32766 like this
Copy code
# top_sql_queries: List[str]
sum(len(top_sql_query.encode("utf-8")) for top_sql_query in top_sql_queries) < 32766
@worried-laptop-98985 I'm not sure if this is a common problem with Datahub, but I encountered a similar error message, just like you did. To troubleshoot, it might be helpful to examine the logs of the GMS container during the ingestion process. If you come across error messages related to Elasticsearch, it could be worth investigating the possibility of resolving the issue by ensuring that the length of
topSqlQueries
is kept under 32766 characters.
w
This does feel like a bug that should be handled by DataHub. May have mis-read your workaround but it looks like you're just ignoring any longer SQL queries. Is that right?
Connor had the same issue and it appears it was fixed by deleting and re-creating the ingest connection, which makes it seem like a problem of state
i
That does sound a bit unusual. In my case, the source I was working with didn't allow me to open the Queries tab during ingestion. As a result, I manually ingested SQL queries using a Python Emitter. I had assumed that DataHub's ingestion process would handle the length limit of SQL queries. However, if you encountered the bug while using DataHub's ingestion, it could indeed be a bug that needs to be addressed by DataHub.
w
Hi @delightful-ram-75848 - you mentioned above that this might be a code issue. Did anything get raised for this? Thx
Just to add a bit more info to this. Deleting and re-adding the ingestion connection did not do anything for me as it did for Connor.
b
Issue is back for me today 😞
and re-creating the connection fixed it again
w
I tried that but no joy