kotlinlang #kotlin-spark

davidh

02/23/2022, 3:52 PM

Hi, is there an alternate maven repo that has the spark-3.2 versions? Maven central only has the 1.0.2 which doesn't have the new 3.2

Michael Livshitz

04/13/2022, 12:05 PM

Hi! Jumped over here just to say that I've recently finished migrating a large set of complex Lucene-Over-Spark pipelines from Scala to Kotlin and it was a blast!

➕ 1

K 2

👍 1

metal 2

Jolan Rensen [JetBrains]

05/26/2022, 4:55 PM

https://blog.jetbrains.com/big-data-tools/2022/05/26/kotlin-api-for-apache-spark-streaming-jupyter-and-more/

K 5

Jolan Rensen [JetBrains]

05/31/2022, 3:33 PM

Hi everyone, we at JetBrains are curious to find out what you use the Kotlin Spark API for. Do you have any problems with it? What would you like to see in upcoming versions? Don't hesitate to let us know 🙂

jaeyol.youn

06/29/2022, 12:43 AM

hello. I need your help. sample stacktrace

Copy code

22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 4
22/06/29 01:58:58 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on lniuhwk3388.nhnjp.ism:42755 in memory (size: 48.1 KB, free: 366.3 MB)
22/06/29 01:58:58 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on lniuhwk4040.nhnjp.ism:36704 in memory (size: 48.1 KB, free: 1458.6 MB)
22/06/29 01:58:58 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on lniuhwk3393.nhnjp.ism:33023 in memory (size: 48.1 KB, free: 1458.6 MB)
22/06/29 01:58:58 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on lniuhwk3870.nhnjp.ism:34781 in memory (size: 48.1 KB, free: 1458.6 MB)
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 23
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 15
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 5
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 1
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 6
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 21
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 17
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 10
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 8
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 7
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 3
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 13
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 2
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 9
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 22
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 12
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 0
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 24
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 19
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 16
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 14
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 18
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 11
22/06/29 01:58:58 INFO spark.ContextCleaner: Cleaned accumulator 20
22/06/29 01:59:00 ERROR yarn.ApplicationMaster: User class threw exception: kotlin.jvm.KotlinReflectionNotSupportedError: Kotlin reflection implementation is not found at runtime. Make sure you have kotlin-reflect.jar in the classpath
kotlin.jvm.KotlinReflectionNotSupportedError: Kotlin reflection implementation is not found at runtime. Make sure you have kotlin-reflect.jar in the classpath
	at kotlin.jvm.internal.ClassReference.error(ClassReference.kt:84)
	at kotlin.jvm.internal.ClassReference.isData(ClassReference.kt:70)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.isSupportedClass(ApiV1.kt:172)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.generateEncoder(ApiV1.kt:167)
	at com.zshp.agg.batch.bannedword.CatalogBannedWordSparkApplication.main(CatalogBannedWordSparkApplication.kt:159)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
22/06/29 01:59:00 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: kotlin.jvm.KotlinReflectionNotSupportedError: Kotlin reflection implementation is not found at runtime. Make sure you have kotlin-reflect.jar in the classpath
	at kotlin.jvm.internal.ClassReference.error(ClassReference.kt:84)
	at kotlin.jvm.internal.ClassReference.isData(ClassReference.kt:70)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.isSupportedClass(ApiV1.kt:172)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.generateEncoder(ApiV1.kt:167)
	at com.zshp.agg.batch.bannedword.CatalogBannedWordSparkApplication.main(CatalogBannedWordSparkApplication.kt:159)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
)
22/06/29 01:59:00 INFO spark.SparkContext: Invoking stop() from shutdown hook
22/06/29 01:59:00 INFO server.AbstractConnector: Stopped Spark@52631839{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
22/06/29 01:59:00 INFO ui.SparkUI: Stopped Spark web UI at <http://lniuhwk3388.nhnjp.ism:38055>
22/06/29 01:59:00 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
22/06/29 01:59:00 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
22/06/29 01:59:00 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
22/06/29 01:59:00 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
22/06/29 01:59:00 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/06/29 01:59:00 INFO memory.MemoryStore: MemoryStore cleared
22/06/29 01:59:00 INFO storage.BlockManager: BlockManager stopped
22/06/29 01:59:00 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
22/06/29 01:59:00 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/06/29 01:59:00 INFO spark.SparkContext: Successfully stopped SparkContext
22/06/29 01:59:00 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: kotlin.jvm.KotlinReflectionNotSupportedError: Kotlin reflection implementation is not found at runtime. Make sure you have kotlin-reflect.jar in the classpath
	at kotlin.jvm.internal.ClassReference.error(ClassReference.kt:84)
	at kotlin.jvm.internal.ClassReference.isData(ClassReference.kt:70)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.isSupportedClass(ApiV1.kt:172)
	at org.jetbrains.kotlinx.spark.api.ApiV1Kt.generateEncoder(ApiV1.kt:167)
	at com.zshp.agg.batch.bannedword.CatalogBannedWordSparkApplication.main(CatalogBannedWordSparkApplication.kt:159)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
)
22/06/29 01:59:00 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
22/06/29 01:59:00 INFO yarn.ApplicationMaster: Deleting staging directory <viewfs://iu/user/z_shopping_ai/.sparkStaging/application_1655162175779_1751556>
22/06/29 01:59:00 INFO util.ShutdownHookManager: Shutdown hook called

jaeyol.youn

06/29/2022, 12:45 AM

Copy code

# gradle version 6.8.2
val kotlinVersion = "1.6.21"
 dependency("org.apache.spark:spark-sql_2.12:2.4.4")
 dependency("org.jetbrains.kotlinx.spark:kotlin-spark-api-2.4_2.12:1.0.2")
 // shadow jar <https://github.com/johnrengelman/shadow>
 dependency("com.github.johnrengelman.shadow:com.github.johnrengelman.shadow.gradle.plugin:6.1.0")

jaeyol.youn

06/29/2022, 12:48 AM

Copy code

Spark Version: 2.4.4

$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster

Jolan Rensen [JetBrains]

06/29/2022, 7:20 PM

Unfortunately we stopped supporting Spark 2.X.X. It became too much to maintain. If you have the possibility, upgrade to Spark 3.0, 3.1, or 3.2 🙂 That's the newest version we support with release 1.1.0. We still have 1.0.2 up for Spark 2.X, but then you're on your own. Although, reading through the error, try adding kotlin-reflect as a dependency to your project. That might solve it temporarily

👍 1

❤️ 1

🙇‍♂️ 1

Jolan Rensen [JetBrains]

08/04/2022, 2:48 PM

Kotlin Spark API v1.2 is out! We now support all combinations Scala and patch versions of Spark, have support for UDTs, UDFs and more. Read the blog below if you want to read more or try it out 🙂 https://blog.jetbrains.com/kotlin/2022/08/kotlin-api-for-apache-spark-v1-2-udts-udfs-rdds-compatibility-and-more/ (And of course, let us know when you have any issues, suggestions, thoughts and whatnot 🙂 )

🎉 1

Michael Livshitz

08/29/2022, 8:33 PM

HI! I've tried using Kotlin Spark v1.2 today, specifically

kotlin-spark-api_3.0.1_2.12:1.2.1

along side

Kotlin 1.6.10

with a

1.8 JVM target

on an existing code base that is already using an older version of Kotlin Spark, but once I've replaced my old dependencies with the new ones, I got

Cannot inline bytecode built with JVM target 11 into bytecode that is being built with JVM target 1.8. Please specify proper '-jvm-target' option

. Now when I go to sources I do see that it its bytecode 55 (Java 11). I was wondering if there is anything I can do about it given I'm stuck using a 1.8 JVM. Thank you for making Spark awesome btw :D

Denis Ambatenne

10/06/2022, 10:39 AM

Hi all! 👋 📣 K The Kotlin team is trying to determine how to develop the Kotlin Spark API further, and we’d love your input. If you are actively using it in your personal or professional projects, please share your experience with us in a phone interview. It will take no more than one hour, and you’ll get a 1-year All Products Pack subscription or a $100 Amazon Gift Card as a reward. 👉 Please choose a time slot in my Calendly set up a call.

👍 3

👍🏼 1

Aleksander Eskilson

11/15/2022, 3:54 PM

Hi folks! I'd posted in the

kotlin-spark-api

gitter a few days ago, https://gitter.im/JetBrains/kotlin-spark-api?at=636d86ea18f21c023ba7b373 There I'd noted that the default Spark classloader can have conflicts with kotlin-spark-api's replacement spark-sql Catalyst converters. The solution there I'd initially tried was to set the

spark.driver.userClassPathFirst

config option. But in practice this causes a lot of other classpath issues with other supplied jars (things like s3 impl jars). Another approach I'd tried was to make a shaded jar of my project, ensuring I included shaded dependencies for

org.jetbrains.kotlinx.spark

and

org.jetbrains.kotlin

. This seems to work when I add my project with the

spark.jars

and also the

spark.driver.extraClassPath

option (which states it'll prepend jars to the Spark classpath, which I'd suppose is the reason the replacement Catalyst converters get properly picked up by spark first), pointing to my project's physical jar. That's all a little unwieldy though. I'd prefer to use the

spark.jars.packages

option for my packaged project, but that doesn't seem to work (the spark-sql overloads seem to not get discovered by the Spark classloader before it's own implementation). Is anyone else experiencing odd classloader issues between Spark and the kotlin-spark-api replacement of Catalyst converters? Ideas on how to resolve these classpath conflicts?

Jolan Rensen [JB]

12/02/2022, 1:12 PM

Kotlin Spark API v1.2.2 is released! https://github.com/Kotlin/kotlin-spark-api/releases/tag/1.2.2 It's just a small update this time, mostly version bumps: • Added

BigInteger

support in #182 thanks to #181 • New Spark versions: 3.2.3, 3.3.1, 3.2.2 • New Scala versions: 2.12.17, 2.13.10 • Updated Kotlin to 1.7.20 • Small bugfix regarding Map encoding You can get the version that works with your Spark/Scala setup using the following table: https://github.com/Kotlin/kotlin-spark-api#supported-versions-of-apache-spark (Might take a couple of hours for Maven Central to update)

holgerbrandl

12/12/2022, 1:43 PM

Hi, when working through https://blog.jetbrains.com/kotlin/2020/08/introducing-kotlin-for-apache-spark-preview/ I've noticed that the IDE seems longer able to restructuring the Tuple2 in

Copy code

citiesWithCoordinates.rightJoin(populations, citiesWithCoordinates.col("name") `===` populations.col("city"))
            .filter { (_, citiesPopulation) ->
                citiesPopulation.population > 15_000_000L
            }

Was this never working, or was destructuring support for Tuple2 dropped?

holgerbrandl

12/12/2022, 1:43 PM

image.png

Jolan Rensen [JB]

12/12/2022, 1:46 PM

@holgerbrandl Destructuring of tuples is definitely part of the API. Make sure to

import org.jetbrains.kotlinx.spark.api.tuples.*

to make sure it's imported 🙂 (the IDE doesn't always auto import stuff like this correctly)

Jolan Rensen [JB]

12/12/2022, 1:54 PM

I do see that you're referring to the blog post from over 2 years ago. There have been a lot of changes since (and that's an understatement). I'd recommend also checking out the readme (https://github.com/Kotlin/kotlin-spark-api), examples (https://github.com/Kotlin/kotlin-spark-api/tree/release/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples) and extensive wiki (https://github.com/Kotlin/kotlin-spark-api/wiki) 🙂 Lot's of up-to-date stuff there

holgerbrandl

12/14/2022, 9:44 AM

Hi, when trying the run a small example defined in https://github.com/holgerbrandl/kalasim/blob/master/modules/sparksim/src/main/kotlin/org/kalasim/exaples/spark/DistributedSim.kt#L26 it fails with

Copy code

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (192.168.217.128 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
... very long stacktrace  ....

I was under the impression that all serialization should be very simply in the expression, and every job is returning a simple string. What am I doing wrong?

Jolan Rensen [JB]

01/09/2023, 12:48 PM

Kotlin Spark API v1.2.3 is released! https://github.com/Kotlin/kotlin-spark-api/releases/tag/1.2.3 Due to high demand, we backported the project to support Java 8 for all modules (except Jupyter). You can get any of the supported versions https://github.com/Kotlin/kotlin-spark-api#supported-versions-of-apache-spark from Maven Central as usual 🙂 Let us know if you have any issues with it!

K 2

Jolan Rensen [JB]

02/27/2023, 1:45 PM

Thanks to Awesome Spark for including us on their list 🙂

👍 2

Jolan Rensen [JB]

07/26/2023, 11:41 AM

Kotlin Spark API v1.2.4 is released https://github.com/Kotlin/kotlin-spark-api/releases/tag/v1.2.4 It includes Java 8 support for Jupyter Notebooks as well now, plus Spark support from 3.0.0-3.3.2. Get the version you need from maven central using this table: https://github.com/Kotlin/kotlin-spark-api#supported-versions-of-apache-spark as usual. If you're using Jupyter, the new version will be used automatically, or you can specify it as described here https://github.com/Kotlin/kotlin-spark-api/wiki/Jupyter. Regarding Spark 3.4+, we're in a bit of a pickle. Due to changes in Spark, the API cannot simply be updated to support that version anymore. It will cost a significant amount of work to try and rebuild the base of the Kotlin Spark API from the ground up and we're investigating whether there's enough demand for it. If you have any ideas how to add support for Kotlin data classes and the inferring

encoder<>()

function, or you simply want to show your interest in the matter, let us know in the relevant issue: https://github.com/Kotlin/kotlin-spark-api/issues/195 As usual, let us know if you have any issues and enjoy the version bump :)

👍 2

🎉 4

Tyler Kinkade

03/05/2024, 3:26 AM

@Jolan Rensen [JB] Hi 🙂 I'm trying to use the Kotlin DataFrame interoperability code from the project wiki to convert a Kotlin dataframe to a Spark dataframe in a Kotlin notebook, but it doesn't compile (error below).

Copy code

%use dataframe
%use spark
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.*
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.*
import org.apache.spark.unsafe.types.CalendarInterval
import org.jetbrains.kotlinx.dataframe.*
import org.jetbrains.kotlinx.dataframe.annotations.DataSchema
import org.jetbrains.kotlinx.dataframe.api.*
import org.jetbrains.kotlinx.dataframe.columns.ColumnKind.*
import org.jetbrains.kotlinx.dataframe.schema.*
import org.jetbrains.kotlinx.spark.api.*
import java.math.*
import java.sql.*
import java.time.*
import kotlin.reflect.*

fun DataFrame<*>.toSpark(spark: SparkSession, sc: JavaSparkContext): Dataset<Row> {
    val rows = sc.toRDD(rows().map(DataRow<*>::toSpark))
    return spark.createDataFrame(rows, schema().toSpark())
}

fun DataRow<*>.toSpark(): Row =
    RowFactory.create(
        *values().map {
            when (it) {
                is DataRow<*> -> it.toSpark()
                else -> it
            }
        }.toTypedArray()
    )

fun DataFrameSchema.toSpark(): StructType =
    DataTypes.createStructType(
        columns.map { (name, schema) ->
            DataTypes.createStructField(name, schema.toSpark(), schema.nullable)
        }
    )

fun ColumnSchema.toSpark(): DataType =
    when (this) {
        is ColumnSchema.Value -> type.toSpark() ?: error("unknown data type: $type")
        is ColumnSchema.Group -> schema.toSpark()
        is ColumnSchema.Frame -> error("nested dataframes are not supported")
        else -> error("unknown column kind: $this")
    }

fun KType.toSpark(): DataType? = when(this) {
    typeOf<Byte>(), typeOf<Byte?>() -> DataTypes.ByteType
    typeOf<Short>(), typeOf<Short?>() -> DataTypes.ShortType
    typeOf<Int>(), typeOf<Int?>() -> DataTypes.IntegerType
    typeOf<Long>(), typeOf<Long?>() -> DataTypes.LongType
    typeOf<Boolean>(), typeOf<Boolean?>() -> DataTypes.BooleanType
    typeOf<Float>(), typeOf<Float?>() -> DataTypes.FloatType
    typeOf<Double>(), typeOf<Double?>() -> DataTypes.DoubleType
    typeOf<String>(), typeOf<String?>() -> DataTypes.StringType
    typeOf<LocalDate>(), typeOf<LocalDate?>() -> DataTypes.DateType
    typeOf<Date>(), typeOf<Date?>() -> DataTypes.DateType
    typeOf<Timestamp>(), typeOf<Timestamp?>() -> DataTypes.TimestampType
    typeOf<Instant>(), typeOf<Instant?>() -> DataTypes.TimestampType
    typeOf<ByteArray>(), typeOf<ByteArray?>() -> DataTypes.BinaryType
    typeOf<Decimal>(), typeOf<Decimal?>() -> DecimalType.SYSTEM_DEFAULT()
    typeOf<BigDecimal>(), typeOf<BigDecimal?>() -> DecimalType.SYSTEM_DEFAULT()
    typeOf<BigInteger>(), typeOf<BigInteger?>() -> DecimalType.SYSTEM_DEFAULT()
    typeOf<CalendarInterval>(), typeOf<CalendarInterval?>() -> DataTypes.CalendarIntervalType
    else -> null
}

Error:

Copy code

Line_31.jupyter.kts (20:48 - 55) 'toSpark' is a member and an extension at the same time. References to such elements are not allowed

How can I fix this?

Eric Rodriguez

03/12/2024, 4:32 PM

👋 Hello, team! I'm using kotlin-spark-api_3.3.2_2.12:12.4 to read an write some parqut files from MinIO through hadoop-aws and aws-java-sdk. I have a StandAlone cluster using docker compose and a I have a jupyter notebook on the same network as well. Everyhting works fine if I launch it from jupyter or from intellij, but when i create a fat-jar with maven-shade-plugin and use

spark-submit

it just stalls forever:

Copy code

24/03/12 12:09:26 INFO StandaloneSchedulerBackend: Granted executor ID app-20240312110926-0002/1 on hostPort 172.20.0.4:42795 with 1 core(s), 1024.0 MiB RAM
24/03/12 12:09:26 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.55:58381 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.55, 58381, None)
24/03/12 12:09:26 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

Eric Rodriguez

03/13/2024, 11:31 AM

anyone knows of a spark 3.3.2 jdk11 image that u can recommend?

👀 1

Eric Rodriguez

03/13/2024, 9:41 PM

why is

spark.sql.codegen.wholeStage

set to.

false

for jupyter?

Eric Rodriguez

03/20/2024, 6:22 AM

So I think the issue is my spark images have hadoop 3.3.2 but the kotlin-spark-api uses 3.3.6, just FYI

Copy code

+- org.jetbrains.kotlinx.spark:kotlin-spark-api_3.3.2_2.12:jar:1.2.4:compile
[INFO] |  +- org.jetbrains.kotlinx.spark:core_3.3.2_2.12:jar:1.2.4:compile
[INFO] |  +- org.jetbrains.kotlinx.spark:scala-tuples-in-kotlin_2.12:jar:1.2.4:compile
[INFO] |  +- org.jetbrains.kotlin:kotlin-stdlib-jdk8:jar:1.8.20:compile
[INFO] |  |  \- org.jetbrains.kotlin:kotlin-stdlib-jdk7:jar:1.8.20:compile
[INFO] |  +- org.apache.spark:spark-streaming_2.12:jar:3.3.2:runtime
[INFO] |  \- org.apache.hadoop:hadoop-client:jar:3.3.6:runtime
[INFO] |     +- org.apache.hadoop:hadoop-common:jar:3.3.6:runtime
[INFO] |     |  +- org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_7:jar:1.1.1:runtime
[INFO] |     |  +- org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.1.1:runtime
[INFO] |     |  +- commons-net:commons-net:jar:3.9.0:runtime

Eric Rodriguez

03/25/2024, 8:38 AM

So I downgraded to jdk11 and made sure hive 3.2.2 is the only version in the deps. and everything works as expected both in

local[*]

and against a cluster on docjer-compose. But then I began to save

delta

forma instead of

parquet

and now I get

Copy code

cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF``

when I run against the cluster (works on lcoal). Any idea how I may debug this?

Slackbot

04/07/2024, 2:49 AM

This message was deleted.

Tyler Kinkade

04/09/2024, 11:10 AM

I can't figure out how to solve this exception: When

withSpark

is invoked dynamically in a Dockerized Ktor web app, an

UnsupportedFileSystemException

is thrown. GitHub issue is here GitHub repo is here. Broadcast.kt (from kotlin-spark-api example)

Copy code

import org.jetbrains.kotlinx.spark.api.map
import org.jetbrains.kotlinx.spark.api.withSpark
import java.io.Serializable

object Broadcast {
    data class SomeClass(val a: IntArray, val b: Int) : Serializable

    fun broadcast(): MutableList<Int> {
        lateinit var result: MutableList<Int>
        withSpark(master = "local") {
            val broadcastVariable = spark.broadcast(SomeClass(a = intArrayOf(5, 6), b = 3))
            result = listOf(1, 2, 3, 4, 5)
                .toDS()
                .map {
                    val receivedBroadcast = broadcastVariable.value
                    it + receivedBroadcast.a.first()
                }
                .collectAsList()
            println(result)
        }
        return result
    }
}

Routing.kt

Copy code

fun Application.configureRouting() {
    routing {
        get("/") {
            val list = Broadcast.broadcast()
            call.respondText(list.toString())
        }
    }
}

Dockerfile

Copy code

# syntax=docker/dockerfile:1
FROM eclipse-temurin:11-jre-jammy AS jre-jammy-spark
RUN curl <https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3-scala2.13.tgz> -o spark.tgz && \
    tar -xf spark.tgz && \
    mv spark-3.3.2-bin-hadoop3-scala2.13 /opt/spark && \
    rm spark.tgz
ENV SPARK_HOME="/opt/spark"
ENV PATH="${PATH}:/opt/spark/bin:/opt/spark/sbin"

FROM gradle:8.4-jdk11 AS gradle-build
COPY --chown=gradle:gradle . /home/gradle/src
WORKDIR /home/gradle/src
RUN gradle buildFatJar --no-daemon

FROM jre-jammy-spark AS app
RUN mkdir -p /app
COPY --from=gradle-build /home/gradle/src/build/libs/*-all.jar /app/app.jar
ENTRYPOINT ["java","-jar","/app/app.jar"]

compose.yaml

Copy code

services:
  app:
    build: .
    ports:
      - 8888:8888

In a shell, run:

Copy code

$ docker compose up

Then, open http://localhost:8888 in a browser. An

org.apache.hadoop.fs.UnsupportedFileSystemException

will be thrown:

Copy code

app-1  | 2024-04-09 10:26:26.484 [main] INFO  ktor.application - Autoreload is disabled because the development mode is off.
app-1  | 2024-04-09 10:26:26.720 [main] INFO  ktor.application - Application started in 0.261 seconds.
app-1  | 2024-04-09 10:26:26.816 [DefaultDispatcher-worker-1] INFO  ktor.application - Responding at <http://0.0.0.0:8888>
app-1  | WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
app-1  | Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
app-1  | 2024-04-09 10:27:07.885 [eventLoopGroupProxy-4-1] WARN  o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
app-1  | WARNING: An illegal reflective access operation has occurred
app-1  | WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/app/app.jar) to constructor java.nio.DirectByteBuffer(long,int)
app-1  | WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
app-1  | WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
app-1  | WARNING: All illegal access operations will be denied in a future release
app-1  | 2024-04-09 10:27:10.066 [eventLoopGroupProxy-4-1] WARN  o.a.spark.sql.internal.SharedState - URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
app-1  | 2024-04-09 10:27:10.071 [eventLoopGroupProxy-4-1] WARN  o.a.spark.sql.internal.SharedState - Cannot qualify the warehouse path, leaving it unqualified.
app-1  | org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file"

Yogeshvu

07/18/2024, 11:46 PM

Is there a support for querying Apache Iceberg tables?