-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
init dataframe with all columns as string data type #22302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what about I'm guessing this would be the best assuming the first doesn't work cols=set()
for a_dict in list_of_dicts:
for a_key in a_dict.keys():
cols.add(a_key)
pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols}) in which case you can make def string_df(list_of_dicts):
cols=set()
for a_dict in list_of_dicts:
for a_key in a_dict.keys():
cols.add(a_key)
return pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols}) and then all you have to do is |
Data from lists of dicts seems to be problematic in general. Even with New issue from today: #22297 Some older similar issues were also commented on earlier today: #15327 #18880 I've had better results by using a fast dumper (e.g. import io
import orjson
...
f = io.BytesIO()
for row in rows:
f.write(orjson.dumps(row))
f.write(b"\n")
df = pl.scan_ndjson(f).collect() |
my dicts are not nested |
I almost suggested |
schema_overrides even loses column and schema is extremely slow/memory hungry for GBs of data.
shape: (2, 2) diagonal_relaxed approach also extremely slow/memory hungry for GBs of data. i have around 80k columns |
Yes, I believe some of the existing issues are about that particular example. Even with pl.DataFrame({"x": [{"a": 1}, {"a": 1, "b": 2}]}, infer_schema_length=None)
|
Description
i have a list of dicts
when i do df = pl.DataFrame([the_list_of_dicts])
it can error with schema infer errors
if i put infer length none its very slow
would love option to say 'ALL columns should become string data type'
the workaround i have now is to pass schema_overrides but that requires listing all the column names
The text was updated successfully, but these errors were encountered: