https://linen.dev logo
#ask-community-for-troubleshooting
Title
# ask-community-for-troubleshooting
n

Nick Joanis

08/11/2021, 3:44 AM
Hi everyone, Quick introduction,  I am a developer currently evaluating Airbyte as an alternative to Fivetran in a personal project of migrating an in-house custom ETL pipeline to an updated / more managed approach. As a test case, I have developed a Lightspeed Retail HTTP connector to extract data from different Lightspeed shops. I am fairly new to data engineering and looking to get insights on multi-tenant ingestion, partitioning and best practices in an Airbyte context. I am aware of the different tenancy models, just unsure of which approach is best in the current context. Q: How should I approach data organisation to identify records to their tenants. In Lightspeed's response / records, there is no unique identifier that would allow me to identify the tenant from the raw data in further stages and eventually into normalization. My connector utilizes an AccountID parameter that would be a good fit to partition data. However, I currently don't see a way to pass such data to the records / streams when syncing. I am aware that my approach might be wrong. I would love to understand the correct approach before jumping into transformations with DBT. Love the project and looking forward to having the opportunity to contribute!
1
g

George Claireaux (Airbyte)

08/11/2021, 9:29 AM
Hey @Nick Joanis 👋 If I understand correctly, you have an AccountID parameter that you pass as part of the HTTP request, and then receive data back for that specific AccountID (a.k.a. tenant in this case?) However, the data itself contains no reference to that AccountID/tenant I've encountered similar scenarios in the past. • One approach would be to extract and load the data separately (per AccountID / tenant) so you have a table per tenant situation, then apply transform on top of these, such as adding a tenant identifier column and unioning to create a full table/view • Another approach would be to build into the connector multi-tenant capability. The user could provide one or more AccountIDs and then we could add this extra identifier field in by mutating the records as they stream. This would result in the same final output as the first approach with no intermediate tables. Let me know if that helps / I've misunderstood your scenario 😄
1
n

Nick Joanis

08/11/2021, 12:47 PM
Hi @George Claireaux (Airbyte), You seem to have very well understood my use case and this is exactly the type of answer I was looking for. 👏 In my previous project, I would have opted for the second option where records would be mutated as they stream. However, in Airbyte, I felt like it was bending the rules too much. Thanks for the info, this is awesome!
g

George Claireaux (Airbyte)

08/11/2021, 12:48 PM
Happy to help 😄
l

Luke Bussey

08/11/2021, 1:13 PM
I found out that it’s easy to append data to the record with the parse_response method when using the CDK. E.g.
Copy code
def parse_response(self, response: requests.Response, stream_state: Mapping[str, Any], stream_slice: Mapping[str, Any] = None, next_page_token: Mapping[str, Any] = None) -> Iterable[Mapping]:
        record = response.json()
        record["account_id"] = stream_slice["account_id"]
        yield record
👍 1
n

Nick Joanis

08/11/2021, 4:08 PM
Thanks for the answer @Luke Bussey. That's very similar to what I ended up doing. 👍
8 Views