Meltano #singer-tap-development

Join Slack

Peter Clemenko

01/19/2025, 2:23 PM

if i wanted to use langchain to automatically generate taps and targets

Peter Clemenko

01/19/2025, 2:23 PM

using llama

Ian OLeary

02/04/2025, 6:52 PM

Copy code

2025-02-04T18:44:21.577497Z [info     ] 2025-02-04 13:44:21,576 | INFO     | tap-litmos.lt_UserDetails | Pagination stopped after 0 pages because no records were found in the last response

Where do I need to alter this to continue paginating even if no records were returned for a particular week? I looked into the BasiAPIPaginator and I didn't find where this message is printed. Here's my current paginator:

Copy code

class LitmosPaginator(BaseAPIPaginator):

    def __init__(self, *args, **kwargs):
        super().__init__(None, *args, **kwargs)

    def has_more(self, response):
        return self.get_next(response) < date.today()

    def get_next(self, response):
        params = dict(parse_qsl(urlparse(response.request.url).query))
        return datetime.strptime(params["to"], OUTPUT_DATE_FORMAT).date() + timedelta(seconds=1)

I'm paginating via a 1 week date range and even if there were no records I still want to move on to the next range. edit: would it be in "advance"? def advance(self, response: requests.Response) -> None: ... ... # Stop if new value None, empty string, 0, etc. if not new_value: self._finished = True else: self._value = new_value

Andy Carter

02/24/2025, 2:52 PM

I have a tap that is based on an Azure Datafactory (ADF) pipeline run - it's a long story.... The tap class itself triggers a pipeline run, which extracts csv data, and saves it into different named files (aligning to tables of a database). Each sdk stream (40+ streams) is checking for the presence of the new file corresponding to its table in storage (using backoff). Once the file arrives it is read and then emitted via the stream in

get_records

in the normal way. Instead of checking and rechecking for each new file in storage, I've discovered I can check for pipeline logs to see as each table / stream file is complete, then just read the file when I know it's saved to storage. However, I don't want to replace my 'file check' code with 'pipeline log check' code in each stream, as the rest call takes a while. Is there a process I can run asynchronously at the
tap
level every 10 seconds or so, and in my
stream.get_records()
check the tap's cached version of the logs from ADF, and emit records if appropriate? Ideally I don't want to wait for the whole pipeline to finish before I start emitting records - some data is ready in seconds but others take minutes.

👀 1

Tanner Wilcox

02/27/2025, 11:38 PM

The API i'm pulling from for my tap doesn't support pagination. I'm having a hard time finding documentation on how to disable that. I can see when I run my tap that it's making a handful of calls to my API. The docs for

request_records())

RESTStream

say: "If pagination is detected, pages will be recursed automatically." but I'm not seeing how it's detecting pagination in this case

✅ 1

Jun Pei Liang

03/07/2025, 12:14 AM

does anyone know where bookmark is saved? i have oracle tap in the following, it keep exporting the entire table .

Copy code

- name: tap-oracle
    variant: s7clarke10
    pip_url: git+<https://github.com/s7clarke10/pipelinewise-tap-oracle.git>
    config:
      default_replication_method: LOG_BASED
      filter_schemas: IFSAPP
      filter_tables:
      - IFSAPP-ABC_CLASS_TAB
      host: xxxx
      port: 1521
      service_name: xxxx
      user: ifsapp
    metadata:
      IFSAPP-ABC_CLASS_TAB:
        replication-method: INCREMENTAL
        replication-key: ROWVERSION

hawkar_mahmod

03/23/2025, 3:06 PM

I've started developing a tap called

tap-growthbook

using the Meltano SDK. When testing without meltano it runs fine, but then when I invoke with

meltano run

via

uv run

I get a discovery related error, and I don't know how to debug it. Here's the error:

✅ 1

hawkar_mahmod

03/24/2025, 3:48 PM

Can someone give me a high-level overview of what happens when parent-child streams run? I'm getting strange behaviour that I don't understand. When I limit the number of records the parents return (by overwriting the

get_records

method) the child class produces the expected number of records but the parent class just stops producing any data in the destination. I am only overwriting on the parent

get_records

not the child stream.

Gordon Klundt

03/24/2025, 7:57 PM

I'm looking for guidance on this particular use case. I'm hoping there is a reference architecture in the world using this pattern. I want to write a stream that reads from an asynchronous endpoint. • I need to POST a payload to get an export ID (endpoint 1) • Use the export ID to poll a status (endpoint 2 - polled until FINISHED) • Get the "chunks" listed in an array when the status reads "FINISHED" (endpoint 3)

Stéphane Burwash

03/25/2025, 5:51 PM

Hey everyone 👋 Happy tuesday! I was wondering if anyone knew how I could only select specific columns to sync using tap-postgres in a json payload. We have a

payload

column which contains WAY too much PII, and I was hoping to be able to sync only the datapoints I needed (only grab values from

payload.xyz

payload.xyz

exists) Thanks! cc @visch since you're my tap-postgres guru

👋 1

hawkar_mahmod

03/28/2025, 5:25 PM

Hey everyone! I'm finding that my pipelines that involve parent-child streams that load to DB are incredibly slow (+2hrs for a couple thousands entities per day), and I believe this is because each entity and it's children are being written to the database one-by-one rather than being batched. Anyone know where I should start to address this? I'm seeing this with both target-redshift and target-duckdb (not as slow as Redshift).

Reuben (Matatika)

04/10/2025, 3:19 AM

Is there a recommended approach to handling API endpoints that support start and end dates in chunks (i.e. sliding window) and emitting/finalising state for each? The API we are dealing with supports exporting data in this way - and there is quite a lot of it - so dividing up into multiple requests and keeping track of the start date in state would be ideal. I implemented it mostly in a custom paginator class, but was not able to find how/where to apply state operations after each "page" (without overriding

request_records

entirely). Thought about setting up a date range parent stream to pass dates through as context, which I think would solve the state update part of my problem, but it felt counter-intuitive to how parent-child streams should work and incremental replication in general (I would end up with a bunch of interim dates stored in state as context, as well as the final date as

replication_key_value

)

👀 1

visch

04/10/2025, 8:21 PM

Parent and Child Stream contexts are shared between any Child Streams that have a shared Parent. Noticed an SDK thing today that I don't think used to be the case but this is a weird one. Would have to put together a tap to show this 🧵

✅ 1

Stéphane Burwash

04/15/2025, 2:06 PM

Hello everyone 👋 happy tuesday! I was wondering if you could give me some guidance on testing taps. I would love to use the integrated

test

functions, but I'd like to understand a bit more how they work (source https://sdk.meltano.com/en/latest/testing.html#singer_sdk.testing.get_tap_test_class) What do the tests actually DO? My goal would be that it only tests that the tap CAN run, and not that it tries to run to completion. Most of my taps are full_table, so running their tests would take WAY too long. Thanks 😄

✅ 1

hawkar_mahmod

05/08/2025, 9:16 AM

Hey everyone 👋 I’m working on a Customer.io tap using the Meltano Singer SDK and running into some odd parent-child behavior. Here’s the gist: - Parent stream (

SegmentsStream

) correctly fetches all segments and my override of

get_child_context(record)

returns

{"segment_id": record["id"]}

for the one segment I’m targeting. - Child stream (

SegmentMembersStream

) has

schema

including

"segment_id"

, no

replication_key

, and overrides

parse_response(response)

to yield only the member fields:

Copy code

python
    def parse_response(self, response):
        for identifier in response.json().get("identifiers", []):
            yield {
                "member_id": identifier["id"],
                "cio_id":    identifier.get("cio_id"),
                "email":     identifier.get("email"),
            }

- According to the docs, the SDK should automatically merge in the

segment_id

from context after

parse_response

(and before shipping the record out), as long as it’s in the schema. But in practice I only see

segment_id

in the separate

context

argument — it never appears in the actual record unless I manually inject it in `post_process`:

Copy code

python
    def post_process(self, row, context):
        row["segment_id"] = context["segment_id"]
        return row

Has anyone else seen this? Should the SDK be automatically adding parent-context fields into the record dict before emit, or is manual injection (in

post_process

) the expected approach here? Any pointers or workaround suggestions are much appreciated! 🙏

Siddu Hussain

05/10/2025, 1:59 AM

Hi @Edgar Ramírez (Arch.dev), do we have a plan to suppress this message when we have additionalProperties set to true? This method was designed to validate date types, but the error message about the missing key in the schema is being spammed for cases where additionalProperties is set to true. The fix seems quite straightforward: adding a condition to check if "additionalProperties" is true, then suppressing the warning "No schema for record field." Please let me know if this message serves a different purpose that might conflict with skipping it by checking "additionalProperties." https://github.com/meltano/sdk/blob/fdeb393416f0d1935e40c28b91c800c9d1b40822/singer_sdk/sinks/core.py#L588 fix : https://github.com/SidduHussain/sdk/blob/f99ecab90770932829f138aece164ad7d4196115/singer_sdk/sinks/core.py#L587

Nico Deunk

05/23/2025, 7:05 AM

I’m working on a tap that has a few ‘synchronization’ endpoints. The intended flow from the API developers is: • Request id’s and versions (with GET) • Compare them against current, stored versions in the target • Request new or updated id’s (multiple, max 100) (with POST - same endpoint) The maximum number of records in the biggest endpoint is about 2.500, so not really much. I could do a full load each time (if that’s easier). How could I handle this case? Example endpoint: https://developer.moneybird.com/api/documents_general_journal_documents/#get_documents_general_journal_documents_synchronization

Rafał

05/27/2025, 8:35 AM

Hi Which component is responsible for clearing batch files after being loaded by target? I didn't find anything in meltano code (after a quick skim) and SDK documentation doesn't mention it

mark_estey

06/02/2025, 9:06 PM

I've got a question about how to implement parent-child streams where the child streams only should run on specific records. In my example I'm syncing a parent "objects" stream (all the object records have the same schema) but there is a "object_type" and I need each child stream to only process the parent records that match a specific object type while ignoring all others. So say there are child streams "foo_object" and "bar_object", and each need the primary key of the parent "object" stream but foo_object only should process parent records where "object_type" = "foo" and ignore all other parent records (otherwise the endpoint will error if a bar_object is requested), and bar_object similarly only processes parent records where "object_type" = "bar". What's the best way for the child streams to only run on the matching the parent records and ignore the rest?

👀 2

Andy Carter

06/04/2025, 10:34 AM

Is it possible to have a child stream with two parents? Say I have a

Buildings

stream and a

Tenants

stream, and a

ServiceRequests

stream that requires a

building_id

AND`tenant_id` in the body of a post request (there is no GET/ list ServiceRequests endpoint). So I would iterate

Buildings

, then

Tenants

then iterate

ServiceRequests

with a full outer join of the two parents. Is that possible? https://www.linen.dev/s/meltano/t/16381950/hi-all-i-have-built-a-custom-tap-to-extract-data-from-an-api looks like a possible approach, basically I make my

Tenants

stream an artificial child of

Buildings

and then make

service requests

have

Tenants

as parent.

Rafał

06/08/2025, 8:28 AM

What's the expected result of running a tap with --discover and --catalog? Is the data from both sources supposed to be merged? https://sdk.meltano.com/en/latest/cli_commands.html#run-in-discovery-mode-with-a-passed-catalog-file

Tanner Wilcox

06/10/2025, 9:04 PM

If I want my tap to drop the whole table every time and start fresh how do I configure that? Chatgpt is telling me to add

replication_method = "FULL_TABLE"

to my stream class but that doesn't seem right

Mindaugas Nižauskas

06/30/2025, 5:06 AM

Hi, the default mysql extractor extractor is from wise, but as far as I understand, we cannot modify it easily, because pipelinewise now maintains a single repository of all taps. On the other hand, we have https://github.com/MeltanoLabs/tap-mysql, but how incremental or log-based replication is configured or modified? Or is it still needs to be developed?

Florian Bergmann

07/07/2025, 8:57 AM

Hi, I am working on a custom version of

tap-oracle

, based on

s7clarke10

pipelinewise

to cover some special cases of our source db. During tests I noticed that the replication method

log_based

uses continuous mine, which is deprecated since Oracle 12c and desupported since Oracle 19c, so six years ago. - Anyone knows an alternative I could use for logmining functionality? - trigger-based instead of log-based is currently no option for us.

Bruno Arnabar

08/07/2025, 9:21 PM

Is this command supposed to work if I am pulling in my tap through a pip_url from github that hasnt been published to meltano hub yet?

Copy code

meltano select tap-canvas --list --all

the tap supports discovery and catalog features

Matthew Wiseman

08/11/2025, 7:32 AM

Curious to know how anyone here would approach this problem. I'm writing a tap that queries a paginated API containing CSV files, with the intention to use the default Postgres target. I then read each file and emit each row. Each CSV file has 1 of 4 different schemas which I've declared with the Meltano Singer SDK - I want each of these file schemas to be separate tables in Postgres. I've declared each file type as a separate stream. I essentially want the main stream to 'fan out' to the individual file streams and they can choose to ignore the file if it matches the type. How have others approached this?

Don Venardos

08/20/2025, 6:28 PM

I am in the process of modifying a copy of tap-mssql --variant singer-io to integrate better with Meltano. I got the discovery option working correctly with Meltano. This issue was reported here: https://gitlab.com/meltano/meltano/-/issues/1175#note_684743065 Currently trying to get the error stream to flow through properly in the meltano run. What behavior does Meltano expect when an error is reported? Note: the error is actually due to how the DB schema is queried which I am fixing as well, but I want to make sure that errors are handled properly while I have one to work with.

meltano run tap-mssql target-jsonl

The tap correctly reports an error:

Copy code

2025-08-20T18:11:00.567018Z [info     ] FATAL [main] tap-mssql.core - Fatal Error Occured - Stream rss_test_dbo_c_logical_field_user_values has unsupported primary key(s): logical_field_sid cmd_type=elb consumer=False job_name=dev:tap-mssql-to-target-jsonl name=tap-mssql producer=True run_id=c8e6cbbf-26d4-415c-83d0-c420ccbf706c stdio=stderr string_id=tap-mssql
2025-08-20T18:11:00.567219Z [info     ] ERROR [main] #error {          cmd_type=elb consumer=False job_name=dev:tap-mssql-to-target-jsonl name=tap-mssql producer=True run_id=c8e6cbbf-26d4-415c-83d0-c420ccbf706c stdio=stderr string_id=tap-mssql
2025-08-20T18:11:00.567439Z [info     ]  :cause Stream rss_test_dbo_c_logical_field_user_values has unsupported primary key(s): logical_field_sid cmd_type=elb consumer=False job_name=dev:tap-mssql-to-target-jsonl name=tap-mssql producer=True run_id=c8e6cbbf-26d4-415c-83d0-c420ccbf706c stdio=stderr string_id=tap-mssql

But then Meltano gets a stack dump.

Copy code

2025-08-20T18:11:00.905173Z [error    ] Extractor failed              
2025-08-20T18:11:00.905403Z [error    ] Block run completed            block_type=ExtractLoadBlocks duration_seconds=89.355 err=RunnerError('Extractor failed') exit_codes={<PluginType.EXTRACTORS: 'extractors'>: 1} run_id=c8e6cbbf-26d4-415c-83d0-c420ccbf706c set_number=0 success=False
2025-08-20T18:11:00.906956Z [info     ] Run completed                  duration_seconds=89.357 run_id=c8e6cbbf-26d4-415c-83d0-c420ccbf706c status=failure
2025-08-20T18:11:00.907373Z [error    ] Need help fixing this problem? Visit <http://melta.no/> for troubleshooting steps, or to join our friendly Slack community.

Run invocation could not be completed as block failed: Extractor failed
╭─────────────────────────────── Traceback (most recent call last) ...

Tanner Wilcox

08/21/2025, 5:44 PM

I wrote a tap that gets info from our network devices using snmp. SNMP has no record of last time updated on these fields so I can't do replication. Currenly I throw away the whole schema before running the tap and get a fresh batch of data each time. We would like to change that and track historical data so that we know if two interfaces that had an OSPF adjacency changed their state. I'm looking for input on the correct way to do this. Do I need to put timestamps on every single record I store and just store everything for every day forever?

Adam Wegscheid

10/14/2025, 8:48 PM

Hello! Can anyone help me understand the difference between the

config_jsonschema

attribute for a

Tap

and the

settings

block in

meltano.yml

? At a high level, they seem to perform the same job. When I view the many extractors on Meltano hub, some have

config_jsonschema

filled to the brim but

settings

are relatively bare bones (no description, title, etc.) and then other taps are the exact opposite. I also find it odd that running

tap-example --about

doesn't print out nested properties of objects regardless of additional properties being set to true so I can't tell if they even matter. As you can see from this mess, my head is all over the place on this topic!

Andy Carter

10/15/2025, 7:22 AM

Are there any good examples of GraphQL taps written using the SDK? Confident with rest APIs but graph is new to me.