Skip to content

init dataframe with all columns as string data type #22302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tooptoop4 opened this issue Apr 16, 2025 · 6 comments
Open

init dataframe with all columns as string data type #22302

tooptoop4 opened this issue Apr 16, 2025 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@tooptoop4
Copy link

Description

i have a list of dicts
when i do df = pl.DataFrame([the_list_of_dicts])

it can error with schema infer errors

if i put infer length none its very slow

would love option to say 'ALL columns should become string data type'

the workaround i have now is to pass schema_overrides but that requires listing all the column names

@tooptoop4 tooptoop4 added the enhancement New feature or an improvement of an existing feature label Apr 16, 2025
@deanm0000
Copy link
Collaborator

deanm0000 commented Apr 16, 2025

what about pl.DataFrame(list_of_dicts,schema_overrides={x:pl.String for x in list_of_dicts[0].keys()}) or if the list_of_dicts doesn't always have the same keys, you could try pl.concat([pl.DataFrame([a_dict]) for a_dict in list_of_dicts], how="diagonal_relaxed"). I'm not trying to discount your feature request, just trying to give workarounds until (if) it's implemented. I'm not confident that that second one would be very fast but maybe ¯_(ツ)_/¯

I'm guessing this would be the best assuming the first doesn't work

cols=set()
for a_dict in list_of_dicts:
    for a_key in a_dict.keys():
        cols.add(a_key)
pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols})

in which case you can make

def string_df(list_of_dicts):
    cols=set()
    for a_dict in list_of_dicts:
        for a_key in a_dict.keys():
            cols.add(a_key)
    return pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols})

and then all you have to do is df=string_df(list_of_dicts)

@cmdlineluser
Copy link
Contributor

Data from lists of dicts seems to be problematic in general.

Even with infer_schema_length=None data can be dropped.

New issue from today: #22297

Some older similar issues were also commented on earlier today: #15327 #18880

I've had better results by using a fast dumper (e.g. orjson) and using the Polars ndjson readers.

import io
import orjson

...

f = io.BytesIO()
for row in rows:
    f.write(orjson.dumps(row))
    f.write(b"\n")

df = pl.scan_ndjson(f).collect()

@tooptoop4
Copy link
Author

my dicts are not nested

@deanm0000
Copy link
Collaborator

I almost suggested pl.read_json(orjson.dumps(list_of_dicts)) but figured it'd have the same issue with mixed data types. Is there a benefit to making it ndjson?

@tooptoop4
Copy link
Author

tooptoop4 commented Apr 16, 2025

schema_overrides even loses column and schema is extremely slow/memory hungry for GBs of data.

ld=[{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]
df=pl.DataFrame(ld,infer_schema_length=1, schema_overrides={'c': String, 'b': String, 'a': String})
print(df)

shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 1 ┆ 2 │
│ 3 ┆ 4 │
└─────┴─────┘
missing c col^

diagonal_relaxed approach also extremely slow/memory hungry for GBs of data. i have around 80k columns

@cmdlineluser
Copy link
Contributor

Yes, I believe some of the existing issues are about that particular example.

Even with infer_schema_length=None it only seems to use the first set of keys.

pl.DataFrame({"x": [{"a": 1}, {"a": 1, "b": 2}]}, infer_schema_length=None)
shape: (2, 1)
┌───────────┐
│ x         │
│ ---       │
│ struct[1] │
╞═══════════╡
│ {1}       │
│ {1}       │
└───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants