Skip to content Skip to sidebar Skip to footer

PySpark 2: KMeans The Input Data Is Not Directly Cached

I don't know why I receive the message WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. When I try to use Spar

Solution 1:

This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.

Internally o.a.s.ml.clustering.KMeans:

  • Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
  • Executes o.a.s.mllib.clustering.KMeans.

While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.


Solution 2:

This was fixed in Spark 2.2.0. Here is the Spark-18356.

The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.


Post a Comment for "PySpark 2: KMeans The Input Data Is Not Directly Cached"