init dataframe with all columns as string data type #22302

tooptoop4 · 2025-04-16T11:41:11Z

Description

i have a list of dicts
when i do df = pl.DataFrame([the_list_of_dicts])

it can error with schema infer errors

if i put infer length none its very slow

would love option to say 'ALL columns should become string data type'

the workaround i have now is to pass schema_overrides but that requires listing all the column names

deanm0000 · 2025-04-16T12:17:31Z

what about pl.DataFrame(list_of_dicts,schema_overrides={x:pl.String for x in list_of_dicts[0].keys()}) or if the list_of_dicts doesn't always have the same keys, you could try pl.concat([pl.DataFrame([a_dict]) for a_dict in list_of_dicts], how="diagonal_relaxed"). I'm not trying to discount your feature request, just trying to give workarounds until (if) it's implemented. I'm not confident that that second one would be very fast but maybe ¯_(ツ)_/¯

I'm guessing this would be the best assuming the first doesn't work

cols=set()
for a_dict in list_of_dicts:
    for a_key in a_dict.keys():
        cols.add(a_key)
pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols})

in which case you can make

def string_df(list_of_dicts):
    cols=set()
    for a_dict in list_of_dicts:
        for a_key in a_dict.keys():
            cols.add(a_key)
    return pl.DataFrame(list_of_dicts, schema_overrides={x:pl.String for x in cols})

and then all you have to do is df=string_df(list_of_dicts)

cmdlineluser · 2025-04-16T13:13:48Z

Data from lists of dicts seems to be problematic in general.

Even with infer_schema_length=None data can be dropped.

New issue from today: #22297

Some older similar issues were also commented on earlier today: #15327 #18880

I've had better results by using a fast dumper (e.g. orjson) and using the Polars ndjson readers.

import io
import orjson

...

f = io.BytesIO()
for row in rows:
    f.write(orjson.dumps(row))
    f.write(b"\n")

df = pl.scan_ndjson(f).collect()

tooptoop4 · 2025-04-16T19:25:28Z

my dicts are not nested

deanm0000 · 2025-04-16T22:01:54Z

I almost suggested pl.read_json(orjson.dumps(list_of_dicts)) but figured it'd have the same issue with mixed data types. Is there a benefit to making it ndjson?

tooptoop4 · 2025-04-16T22:42:42Z

schema_overrides even loses column and schema is extremely slow/memory hungry for GBs of data.

ld=[{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]
df=pl.DataFrame(ld,infer_schema_length=1, schema_overrides={'c': String, 'b': String, 'a': String})
print(df)

shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ 1 ┆ 2 │
│ 3 ┆ 4 │
└─────┴─────┘
missing c col^

diagonal_relaxed approach also extremely slow/memory hungry for GBs of data. i have around 80k columns

cmdlineluser · 2025-04-17T06:15:50Z

Yes, I believe some of the existing issues are about that particular example.

Even with infer_schema_length=None it only seems to use the first set of keys.

pl.DataFrame({"x": [{"a": 1}, {"a": 1, "b": 2}]}, infer_schema_length=None)

shape: (2, 1)
┌───────────┐
│ x         │
│ ---       │
│ struct[1] │
╞═══════════╡
│ {1}       │
│ {1}       │
└───────────┘

tooptoop4 added the enhancement New feature or an improvement of an existing feature label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init dataframe with all columns as string data type #22302

init dataframe with all columns as string data type #22302

tooptoop4 commented Apr 16, 2025

deanm0000 commented Apr 16, 2025 •

edited

Loading

cmdlineluser commented Apr 16, 2025

tooptoop4 commented Apr 16, 2025

deanm0000 commented Apr 16, 2025

tooptoop4 commented Apr 16, 2025 •

edited

Loading

cmdlineluser commented Apr 17, 2025

init dataframe with all columns as string data type #22302

init dataframe with all columns as string data type #22302

Comments

tooptoop4 commented Apr 16, 2025

Description

deanm0000 commented Apr 16, 2025 • edited Loading

cmdlineluser commented Apr 16, 2025

tooptoop4 commented Apr 16, 2025

deanm0000 commented Apr 16, 2025

tooptoop4 commented Apr 16, 2025 • edited Loading

cmdlineluser commented Apr 17, 2025

deanm0000 commented Apr 16, 2025 •

edited

Loading

tooptoop4 commented Apr 16, 2025 •

edited

Loading