Files
2026-03-21 19:36:11 +03:00

1.3 KiB

Files, Streams, and Data Processing Guidelines

Basic principles

  • Prefer streaming and iterators over loading entire large files into memory.
  • Be explicit about encodings; default to UTF-8 when reasonable.
  • Use context managers for all resources (with open(...) as f:).

Large files and performance

  • For large text/binary files:
    • process in chunks or line-by-line
    • consider mmap for specific use cases where it simplifies access patterns.
  • Avoid unnecessary copies of large data structures.
  • For data processing, consider columnar formats (e.g. Parquet) when appropriate.

Safety and atomicity

  • For writes that must not corrupt data:
    • write to a temporary file
    • fsync if necessary
    • then atomically rename.
  • Validate paths and avoid directory traversal vulnerabilities when working with user-supplied paths.
  • Handle missing directories gracefully (create them when sensible, or fail with a clear error).

Formats and parsing

  • Prefer standard libraries (json, csv, pathlib) where possible.
  • When using third-party libraries (e.g. pyyaml), use safe loading functions.
  • Clearly define schemas (via pydantic or dataclasses) when reading structured data.

Cross-platform behavior

  • Use pathlib instead of manual string path manipulation.
  • Be mindful of line endings, file permissions, and case sensitivity across OSes.