# Files, Streams, and Data Processing Guidelines ## Basic principles - Prefer streaming and iterators over loading entire large files into memory. - Be explicit about encodings; default to UTF-8 when reasonable. - Use context managers for all resources (`with open(...) as f:`). ## Large files and performance - For large text/binary files: - process in chunks or line-by-line - consider `mmap` for specific use cases where it simplifies access patterns. - Avoid unnecessary copies of large data structures. - For data processing, consider columnar formats (e.g. Parquet) when appropriate. ## Safety and atomicity - For writes that must not corrupt data: - write to a temporary file - fsync if necessary - then atomically rename. - Validate paths and avoid directory traversal vulnerabilities when working with user-supplied paths. - Handle missing directories gracefully (create them when sensible, or fail with a clear error). ## Formats and parsing - Prefer standard libraries (`json`, `csv`, `pathlib`) where possible. - When using third-party libraries (e.g. `pyyaml`), use safe loading functions. - Clearly define schemas (via `pydantic` or dataclasses) when reading structured data. ## Cross-platform behavior - Use `pathlib` instead of manual string path manipulation. - Be mindful of line endings, file permissions, and case sensitivity across OSes.