1.3 KiB
1.3 KiB
Files, Streams, and Data Processing Guidelines
Basic principles
- Prefer streaming and iterators over loading entire large files into memory.
- Be explicit about encodings; default to UTF-8 when reasonable.
- Use context managers for all resources (
with open(...) as f:).
Large files and performance
- For large text/binary files:
- process in chunks or line-by-line
- consider
mmapfor specific use cases where it simplifies access patterns.
- Avoid unnecessary copies of large data structures.
- For data processing, consider columnar formats (e.g. Parquet) when appropriate.
Safety and atomicity
- For writes that must not corrupt data:
- write to a temporary file
- fsync if necessary
- then atomically rename.
- Validate paths and avoid directory traversal vulnerabilities when working with user-supplied paths.
- Handle missing directories gracefully (create them when sensible, or fail with a clear error).
Formats and parsing
- Prefer standard libraries (
json,csv,pathlib) where possible. - When using third-party libraries (e.g.
pyyaml), use safe loading functions. - Clearly define schemas (via
pydanticor dataclasses) when reading structured data.
Cross-platform behavior
- Use
pathlibinstead of manual string path manipulation. - Be mindful of line endings, file permissions, and case sensitivity across OSes.