Joining two list columns in Polars
Today I was working on my Polars code and I needed to join two two columns.
I should say this is my first time working with Polars and doing data analysis/manipulation/science at all, but I figured you guys might run into the same thing that I did.
I will try to post anything that I find interesting or that takes me some time to figure out.
I needed to join two columns of which one of them might not have a value into a list of strings column.
There is probably (definitely, there’s gotta be…) a better way of doing this, and I am also considering just learning some Rust to understand more how all of this is being process behind the scenes, maybe even write some plugins as well, as there have been some instances where using the eval function leaves me disatisfied.
Anyways, I was trying to concat two lists where one of the columns, or even both, could have null values. This resulted in a list that had one or more null values as one of the entries.
Here is the code snippet:
./src/python/main.py
df = df.with_columns(
items_list=pl.when(
pl.col("item_1").is_not_null(),
).then(
pl.concat_list(
pl.col("item_1"),
pl.when(
pl.col("item_2_from_join").is_not_null(),
).then(
pl.col("item_2_from_join")
).otherwise(
pl.lit(None, dtype=pl.String)
)
)
)
)
Apparently, doing this conditional checking does not work as I thought it would. I was expecting something like this to happen
./src/python/main.py
ret = []
value_1 = "some"
value_2 = "value" or None
if value_1:
if not value_2:
ret.append(value_1)
else:
ret.append(value_2)
I don’t think my Polars code reflected that at all. And that is one of the main pitfalls I get into when writting Polars or Pandas. I tend to get tunnel vision and forget how to express my thoughts. I have tried to make it a habit to first write the logic in my language flavor of the day, in this case Python, and then trying to translate it to Polars or any other library. It is almost like using an ORM, which in itself is one more thing that I makes me forget that in reality, we are just working with SQL.
I ended up coming up with this code:
./src/python/main.py
df = df.with_columns(
items_list=pl.when(
pl.col("item_1").is_not_null() & pl.col("item_2_from_join").is_not_null()
).then(
pl.concat_list(pl.col("item_1"), pl.col("item_2_from_join"))
).otherwise(
pl.col("item_1").cast(pl.List(pl.String()))
)
)
Also, while writing this I notice that would be equivalent to this:
./src/python/main.py
ret = []
value_1 = "some"
value_2 = "value" or None
if value_1 and value_2:
ret.append(value_1)
ret.append(value_2)
else:
ret.append(value_1)
As you can see. I am assuming that value_1 is always valid.
After writting this note I went back and decided to add another conditional check to get either value_1 or value_2, whichever is valid at that time.
Basically. Not only my Polars logic was wrong, but so was my Python logic :D.
You live and you learn, and the point of this is to point out my mistakes so I can come back in the future and remember how stupid I was.