Very nice that it can show the metadata. If you rather focus on the data itself, a Swiss army knife in the terminal is VisiData [1] . It works with many formats from CSV to Parquet. You'd need to install Pyarrow I think to read Parquet files. VisiData is great to not only peek into the file but filter it, sort, compute simple metrics and even can plot a histogram or scatterplot for ex. I avoided a lot of Jupyter notebooks by using VisiData :)
Nice work—this hits a real pain point with Parquet.
My main use case is debugging partitioned datasets on S3 with schema drift and skew, where I care about: which files/partitions have schema mismatches, weird row-group stats (all-null, out-of-range, huge skew), and doing that via metadata only.
Right now parqeye looks mainly single-file focused. Do you have plans for a “dataset mode” that takes a dir/S3 prefix and surfaces per-file/row-group summaries (row counts, min/max, null %, schema diffs vs a reference file) using just Parquet stats so it scales to tens of GB? Or do you see parqeye intentionally staying a single-file inspector?
Great! I worked a lot with parquet like 5 years ago. The frustration and tilt working with the tooling was immense. Thank you for building this, it feels like resolving some old knot in my soul.
Some kind soul made this repository then, and I found it on like the 13th page of Google while in the depths of despair. It is my most treasured GitHub star, a the shining beacon that saved me. I see it has saved 17 other people too.
Similar tool for JSONL files: I built JSONL Viewer Pro after repeatedly crashing VS Code trying to inspect multi-GB training datasets and IoT device logs with nested objects.
Native Mac/Windows app with multi-threaded parsing (simdjson), automatic nested object flattening, and handles 10M+ rows instantly.
Super quick feedback - opening that link on my phone shows me two options next to each other, seemingly with the same name / description (followed by …) and same pricetag. I had to turn my phone sideways to see that there is a windows and a Mac version.
I think you can afford the extra characters to show the whole page in portrait mode. (iPhone 16 pro Safari)
This looks very handy, thank you for working on this and making it open source.
I did submit a feature request for vi keybindings; though I could look into contributing this myself if I find a bit of spare time.
The other thing that surprised me was the size of the binaries: 90MB for a TUI tool (x64 Linux)? I wonder what the bulk of that is? Is there an issue with LTO? An other commenter noticed as well.
It also looks like you are building against a relatively recent glibc (2.34), which limits compatibility with older systems. Building against an older glibc can be hard to do, so I am not faulting you here, and you do provide a musl fallback, which is appreciated (mandatory notice that the musl allocator can dramatically degrade the performance of rust programs, just in case you were not aware of this).
A few more ideas for improvements (you probably already have your own laundry list):
- Mouse support?
- Seeing that you do have graphs, it would be fun to see a scatter plot as well as a distribution plot under statistics in the "Row Groups" tab (though you probably pull these from the metadata, so that would require further processing, which may be out of scope).
It's unfortunate that Python and R don't really have any out-of-the-box means of opening data files from arguments, but if you do this kind of stuff on a daily basis it's something that you can set up. My not directly usable examples below.
Beautiful, I'm currently deep into getting our data into iceberg from firehose and I'm really curious what metadata is written, are bloomfilters being written for the columns i want? Has my compaction and sort jobs helped min-max statistics on those columns?
It is really incredible how poor the parquet tooling has been for years. The cornerstone of data engineering, yet just inspecting a file is needlessly clunky.
Can DuckDB be included in the tool, so you can run queries directly from the UI? [that would avoid opening DBeaver whenever you need that kind of feature]
[1] https://www.visidata.org/
Right now parqeye looks mainly single-file focused. Do you have plans for a “dataset mode” that takes a dir/S3 prefix and surfaces per-file/row-group summaries (row counts, min/max, null %, schema diffs vs a reference file) using just Parquet stats so it scales to tens of GB? Or do you see parqeye intentionally staying a single-file inspector?
[1] https://github.com/Vitruves/nail-parquet [2] https://github.com/NixOS/nixpkgs/pull/449066
https://github.com/llimllib/personal_code/blob/c1a74b1b9527f...
Another seemingly extremely similar project released in the last few days: https://github.com/raulcd/datanomy
Some kind soul made this repository then, and I found it on like the 13th page of Google while in the depths of despair. It is my most treasured GitHub star, a the shining beacon that saved me. I see it has saved 17 other people too.
https://github.com/casidiablo/parquet-tools-for-dumb-people-...
Native Mac/Windows app with multi-threaded parsing (simdjson), automatic nested object flattening, and handles 10M+ rows instantly.
For HN: Use code HN100 for free access
https://iotdatasystems.gumroad.com/
Built with C++ for native performance (~6MB app, not Electron).
Would love feedback from folks working with large JSONL files.
I think you can afford the extra characters to show the whole page in portrait mode. (iPhone 16 pro Safari)
https://imgur.com/a/aTxO3sp
Also just added a Data Plot feature for visualizing numeric columns.
Thanks to everyone who reported the issue!
I did submit a feature request for vi keybindings; though I could look into contributing this myself if I find a bit of spare time.
The other thing that surprised me was the size of the binaries: 90MB for a TUI tool (x64 Linux)? I wonder what the bulk of that is? Is there an issue with LTO? An other commenter noticed as well.
It also looks like you are building against a relatively recent glibc (2.34), which limits compatibility with older systems. Building against an older glibc can be hard to do, so I am not faulting you here, and you do provide a musl fallback, which is appreciated (mandatory notice that the musl allocator can dramatically degrade the performance of rust programs, just in case you were not aware of this).
A few more ideas for improvements (you probably already have your own laundry list):
- Mouse support?
- Seeing that you do have graphs, it would be fun to see a scatter plot as well as a distribution plot under statistics in the "Row Groups" tab (though you probably pull these from the metadata, so that would require further processing, which may be out of scope).
Python (uv + dataiter, but easy to modify for pandas or polars): https://github.com/otsaloma/dataiter/blob/master/bin/di-open
R (as per comment, requires also ~/.Rprofile code, nanoparquet in this case): https://github.com/otsaloma/R-tools/blob/master/r-load
Will take a look when i get to my laptop!
Also allows you to do computations on the data in place.
BTW, you can use duckdb with their ui plugin to have an interactive view of your data, not only parquet.
Note: must the Windows binary really be 78MB ?
i tried to install with brew, but it told me my cli tools were "too out of date". Never seen that before! and also just upgraded.
Will try again tomorrow