Hello, after ramping up our development, we're now...
# help
w
Hello, after ramping up our development, we're now seeing a lot of problems with atlas/mongodb connection timeout failures. During our initial testing, we weren't seeing any connection issues, but once we started ramping up traffic across several functions, we started to see extended periods of connection failures, some as long as an hour, and then, it'll be fine for a while. We have one function in particular, the busiest of them, that is plagued by these issues. After looking at the atlas logs, we noticed a LOT of authentication errors, which weren't really errors, but due to atlas only supporting SHA-1, and if you're using later mongodb npm, it will first try with SHA-256, but will subsequently try with SHA-1. Thinking our issues might be related to this thrash, we explicitly set the auth type in the connection string, which resolved the errors on the atlas side, but still experiencing the connection timeouts. We are far under our connection limit for our atlas cluster, and scoured the security groups, network acl's, etc. Our setup is a typical peered VPC with atlas, and followed the guide https://serverless-stack.com/examples/how-to-use-mongodb-atlas-in-your-serverless-app.html as to how to setup the connection caching, etc. I've found another post where they aren't bothering with the VPC peering, and are not experiencing issues. I almost feel this has to be something going on at the AWS networking level. We see nothing in the atlas logs, and if it were an acl, routing or security group issue, it would always fail. I'm creating some cloudwatch metric graphs so we see the behavior across all functions that have this issue, over time and look for correlations, but that's not going to solve the problem itself. Not sure what else to do to debug.
g
https://www.mongodb.com/blog/post/introducing-mongodb-atlas-data-api-now-available-preview this is going to avoid a lot of problems.. it's similar with dynamodb
a
when you say connection timeouts, what happens, does your lambda timeout or are you unable to get a connection to the atlas cluster at all and so the lambda errors out?
Also, could you share the connection setup code with your config options and an example as to how you’re using it?
w
The connection times out, resulting in an unhandled promise rejection error:
Copy code
2022-02-03T21:50:20.763Z	8c855639-1beb-473a-870e-bbb2819c9fb5	ERROR	Unhandled Promise Rejection 	{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "MongoServerSelectionError: connection timed out",
    "reason": {
        "errorType": "MongoServerSelectionError",
        "errorMessage": "connection timed out",
        "name": "MongoServerSelectionError",
        "reason": {
            "type": "ReplicaSetNoPrimary",
            "setName": "atlas-ug2v9v-shard-0",
            "maxSetVersion": 1,
            "maxElectionId": "7fffffff0000000000000012",
            "servers": {},
            "stale": false,
            "compatible": true,
            "compatibilityError": null,
            "logicalSessionTimeoutMinutes": null,
            "heartbeatFrequencyMS": 10000,
            "localThresholdMS": 15,
            "commonWireVersion": 9
        },
        "stack": [
            "MongoServerSelectionError: connection timed out",
            "    at Timeout._onTimeout (/var/task/src/kinesis.js:18862:34)",
            "    at listOnTimeout (internal/timers.js:557:17)",
            "    at processTimers (internal/timers.js:500:7)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: MongoServerSelectionError: connection timed out",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:400:28)",
        "    at processPromiseRejections (internal/process/promises.js:245:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:96:32)"
    ]
}
The function connection setup:
Copy code
import * as mongodb from 'mongodb';
const MongoClient = mongodb.MongoClient;

// Once we connect to the database once, we'll store that connection
// and reuse it so that we don't have to connect to the database on every request.
let cachedDb = null;

async function connectToDatabase() {
  console.log('connectToDatabase');
  if (cachedDb) {
    return cachedDb;
  }

  const connectionURL = process.env.MONGO_URL;
  // console.log(
  //   `the env connection url: ${process.env.MONGO_URL} and the variable from same: ${connectionURL}`
  // );
  const client = await MongoClient.connect(connectionURL, {
    useUnifiedTopology: true,
  });

  cachedDb = await client.db(process.env.MONGO_DB);

  console.log(`returning cachedDb: ${cachedDb}`);
  return cachedDb;
}

export async function main(event, context) {
  // console.log(`the kinesis event: ${JSON.stringify(event)}`);

  // per atlas docs, this isn't appropriate as we're in async without a callback arg
  // tried this both ways, no difference in error frequency
  // context.callbackWaitsForEmptyEventLoop = false;

  const db = await connectToDatabase();
  console.log(`The db: ${db}`);
  event.Records.forEach((item) => {
    let payload = Buffer.from(item.kinesis.data, 'base64').toString('utf-8');

    const record = JSON.parse(payload);
    // console.log(
    //   `the record decoded, to be inserted: ${JSON.stringify(record)}`
    // );

    const insertRecord = {
      customer_id: record.customer_id,
      device_id: record.item_id,
      sensorid: record.sensorid,
      state: record.state,
      voltage: record.voltage,
      type: record.type,
      signal: record.signal,
      data: record.deviceData,
      customer_code: record.customer_code,
      sensortimestamp: new Date(record.sensortimestamp),
    };
    // console.log(`inserting ${JSON.stringify(insertRecord)}`);
    db.collection(process.env.MONGO_COLLECTION).insertOne(insertRecord);
  });

  return {
    statusCode: 200,
    body: 'ok',
  };
}
And, the connection url is modified to include &authMechanism=SCRAM-SHA-1, per Atlas recommendation, as that avoids the second attempt after initially trying using ...SHA-256, because as of now, they only support SHA-1 Thanks!
@Gabriel Araújo I think this is our most likely workaround refactor. Thank you!
a
what’s the cluster type is it a shared cluster or a dedicated one?
w
It's an M10 class replica set with three nodes; not sharded
We're refactoring to bale out on native access for now, using the preview of Atlas's API services . If that's not stable enough, our next plan is to create a dedicated service in k8s that will abstract all the atlas calls, removing lambda entirely from that process.
g
Could you post an update here once is done? Im curious about Atlas's API in production
w
@Gabriel Araújo so far, it's great! We have all functions converted to using it, and the response times are less than 1/10 of a second. No errors. I'll post back if anything horrible comes up, but so far, so good
g
Thanks!! I'll try here soon
b
Following up on my partner's (@William Hatch) post. After switching to the Atlas Data API our connections are working nicely. No issues whatsoever. 100% connectivity for 4 days now, with good response time. For our project it was not a super big refactor to make the switch.
a
yep makes sense as http is more reliable than tcp and uses standard ports.