Skip to content Skip to sidebar Skip to footer

Flatmap Over List Of Custom Objects In Pyspark

I'm getting an error when running flatMap() on a list of objects of a class. It works fine for regular python data types like int, list etc. but I'm facing an error when the list c

Solution 1:

Error you get is completely unrelated to flatMap. If you define node class in your main script it is accessible on a driver but it is not distributed to the workers. To make it work you should place node definition inside separate module and makes sure it is distributed to the workers.

  1. Create separate module with node definition, lets call it node.py
  2. Import this node class inside your main script:

    from node import node
    
  3. Make sure module is distributed to the workers:

    sc.addPyFile("node.py")
    

Now everything should work as expected.

On a side note:

  • PEP 8 recommends CapWords for class names. It is not a hard requirement but it makes life easier
  • __repr__ method should return a string representation of an object. At least make sure it is a string, but a proper representation is even better:

    def__repr__(self):
         return"node({0})".format(repr(self.value))
    

Post a Comment for "Flatmap Over List Of Custom Objects In Pyspark"