35 lines
1.3 KiB
Markdown
35 lines
1.3 KiB
Markdown
# Files, Streams, and Data Processing Guidelines
|
|
|
|
## Basic principles
|
|
|
|
- Prefer streaming and iterators over loading entire large files into memory.
|
|
- Be explicit about encodings; default to UTF-8 when reasonable.
|
|
- Use context managers for all resources (`with open(...) as f:`).
|
|
|
|
## Large files and performance
|
|
|
|
- For large text/binary files:
|
|
- process in chunks or line-by-line
|
|
- consider `mmap` for specific use cases where it simplifies access patterns.
|
|
- Avoid unnecessary copies of large data structures.
|
|
- For data processing, consider columnar formats (e.g. Parquet) when appropriate.
|
|
|
|
## Safety and atomicity
|
|
|
|
- For writes that must not corrupt data:
|
|
- write to a temporary file
|
|
- fsync if necessary
|
|
- then atomically rename.
|
|
- Validate paths and avoid directory traversal vulnerabilities when working with user-supplied paths.
|
|
- Handle missing directories gracefully (create them when sensible, or fail with a clear error).
|
|
|
|
## Formats and parsing
|
|
|
|
- Prefer standard libraries (`json`, `csv`, `pathlib`) where possible.
|
|
- When using third-party libraries (e.g. `pyyaml`), use safe loading functions.
|
|
- Clearly define schemas (via `pydantic` or dataclasses) when reading structured data.
|
|
|
|
## Cross-platform behavior
|
|
|
|
- Use `pathlib` instead of manual string path manipulation.
|
|
- Be mindful of line endings, file permissions, and case sensitivity across OSes. |