############
Contributing
############

Our project welcomes new contributors who want to help us fix bugs and improve our scrapers. The current status of our effort is documented in our `sources list <./sources.html>`_ and `issue tracker <https://github.com/biglocalnews/warn-scraper/issues>`_. We want your help. We need your help. Here's how to get started.

Adding features and fixing bugs is managed using the GitHub’s `pull request <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests>`_ system.

The tutorial that follows assumes you have the `Python <https://www.python.org/>`_ programming language, the `pipenv <https://pipenv.pypa.io/>`_ package manager and the `git <https://git-scm.com/>`_ version control system already installed. If you don't, you'll want to address that first.

.. contents:: The typical workflow
    :depth: 2
    :local:

Create a fork
#############

Start by visiting our project’s repository at `github.com/biglocalnews/warn-scraper <https://github.com/biglocalnews/warn-scraper>`_ and creating a fork. You can learn how from `GitHub’s documentation <https://docs.github.com/en/get-started/quickstart/fork-a-repo>`_.

Clone the fork
##############

Now you need to make a copy of your fork on your computer using GitHub’s cloning system. There are several methods to do this, which are coverd in the `GitHub documentation <https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository>`_.

A typical terminal command will look something like the following, with your username inserted in the URL.

.. code-block:: bash

    git clone git@github.com:yourusername/warn-scraper.git

Install dependencies
####################

You should `change directory <https://manpages.ubuntu.com/manpages/trusty/man1/cd.1posix.html>`_ into folder where you code was downloaded.

.. code-block:: bash

    cd warn-scraper

The `pipenv` package manager can install all of the Python tools necessary to run and test our code.

.. code-block:: bash

    pipenv install --dev

Now install `pre-commit` to run a battery of automatic quick fixes against your work.

.. code-block:: bash

    pipenv run pre-commit install

Create an issue
###############

Before you being coding, you should visit our `issue tracker <https://github.com/biglocalnews/warn-scraper/issues>`_ and create a new ticket. You should try to clearly and succinctly describe the problem you are trying to solve. If you haven't done it before, GitHub has a guide on `how to create an issue <https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-an-issue>`_.

Create a branch
###############

Next will we `create a branch <https://www.w3schools.com/git/git_branch.asp>`_ in your local repository where you an work without disturbing the mainline of code.

You can do this by running a line of code like the one below. You should substitute a branch that summarizes the work you're trying to do.

.. code-block:: bash

    git checkout -b your-branch-name

For instance, if you were trying to add a scraper for the state of Iowa, you might name your branch `ia-scraper`.

.. code-block:: bash

    git checkout-b ia-scraper

We ask that you follow a pattern where the branch name includes the postal code of the state you're working on, combined with the issue number generated by GitHub. In this example, work on the New Jersey scraper collected in GitHub issue #100.

.. code-block:: bash

    git checkout -b nj-100

Perform your work
#################

Now you can begin your work. You can start editing the code on your computer, making changes and running scripts to iterate toward your goal.

Creating a new scraper
----------------------

When adding a new state, you should create a new Python file in the `warn/scrapers` named with the state's postal code. Here is an example of a starting point you can paste in to get going.

.. code-block:: python

    from pathlib import Path

    from .. import utils
    from ..cache import Cache


    def scrape(
        data_dir: Path = utils.WARN_DATA_DIR,
        cache_dir: Path = utils.WARN_CACHE_DIR,
    ) -> Path:
        """
        Scrape data from Iowa.

        Keyword arguments:
        data_dir -- the Path were the result will be saved (default WARN_DATA_DIR)
        cache_dir -- the Path where results can be cached (default WARN_CACHE_DIR)

        Returns: the Path where the file is written
        """
        # Grab the page
        page = utils.get_url("https://xx.gov/yy.html")
        html = page.text

        # Write the raw file to the cache
        cache = Cache(cache_dir)
        cache.write("xx/yy.html", html)

        # Parse the source file and convert to a list of rows, with a header in the first row.
        ## It's up to you to fill in the blank here based on the structure of the source file.
        ## You could do that here with BeautifulSoup or whatever other technique.
        pass

        # Set the path to the final CSV
        # We should always use the lower-case state postal code, like nj.csv
        output_csv = data_dir / "xx.csv"

        # Write out the rows to the export directory
        utils.write_rows_to_csv(output_csv, cleaned_data)

        # Return the path to the final CSV
        return output_csv


    if __name__ == "__main__":
        scrape()

When creating a scraper, there are a few rules of thumb.

1. The raw data being scraped — whether it be HTML, CSV or PDF — should be saved to the cache unedited. We aim to store pristine versions of our source data.

2. The data extracted from source files should be exported as a single file.  Any intermediate files generated during data processing should not be written to the data folder. Such files should be written to the cache directory.

3. The final export should be the state's postal code, in lower case. For example, Iowa's final file should be saved as `ia.csv`.

4. For simple cases, use a cache name identical to the final export name.

5. If many files need to be cached, create a subdirectory using the lower-case state postal code and apply a sensible naming scheme to the cached files (e.g. `mo/page_1.html`).

Here's an example directory demonstrating the above conventions:

.. code-block:: bash

    ├── cache
    │   ├── mo.csv
    │   ├── nj
    │   │   ├── Jan2010Warn.html
    │   │   └── Jan2011Warn.html
    │   ├── ny_raw_1.csv
    │   └── ny_raw_2.csv
    └── exports
        ├── mo.csv
        ├── nj.csv
        └── ny.csv

A fully featured example to learn from is `Chris Zubak-Skees’ scraper for Georgia <https://github.com/biglocalnews/warn-scraper/blob/main/warn/scrapers/ga.py>`_.

Running the CLI
---------------

After a scraper has been created, the command-line tool provides a method to test code changes as you go. Run the following, and you'll see the standard help message.

.. code-block:: bash

    pipenv run python -m warn.cli --help

    Usage: python -m warn.cli [OPTIONS] [SCRAPERS]...

      Command-line interface for downloading WARN Act notices.

      SCRAPERS -- a list of one or more postal codes to scrape. Pass `all` to
      scrape all supported states and territories.

    Options:
      --data-dir PATH                 The Path were the results will be saved
      --cache-dir PATH                The Path where results can be cached
      --delete / --no-delete          Delete generated files from the cache
      -l, --log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                      Set the logging level
      --help                          Show this message and exit.

Running a state is as simple as passing arguments to that same command. If you were trying to develop an Iowa scraper found in the `warn/scrapers/ia.py` file, you could run something like this.

.. code-block:: bash

    pipenv run python -m warn.cli IA

For more verbose logging, you can ask the system to showing debugging information.

.. code-block:: bash

    pipenv run python -m warn.cli IA -l DEBUG

You could continue to iterate with code edits and CLI runs until you've completed your goal.

Run tests
#########

Before you submit your work for inclusion in the project, you should run our tests to identify bugs. Testing is implemented via pytest. Run the tests with the following.

.. code-block:: bash

    make test

If any errors, arise, carefully read the traceback message to determine what needs to be repaired.

Push to your fork
#################

Once you're happy with your work and the tests are passing, you should commit your work and push it to your fork.

.. code-block:: bash

    git commit -am "Describe your work here"
    git push -u origin your-branch-name

If there have been significant changes to the main branch since you started work, you should consider integrating those edits to your branch since any differences will need to be reconciled before your code can be merged.

.. code-block:: bash

    # Checkout and pull updates on main
    git checkout main
    git pull

    # Checkout your branch again
    git checkout your-branch-name

    # Rebase your changes on top of main
    git rebase main

If any `code conflicts <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/about-merge-conflicts>`_ arise, you can open the listed files and seek to reconcile them yourself. If you need help, reach out to the maintainers.

Once that's complete, commit any changes and push again to your fork's branch.

.. code-block:: bash

    git commit -am "Merged in main"
    git push origin your-branch-name

Send a pull request
###################

The final step is to submit a `pull request <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests>`_ to the main respository, asking the maintainers to consider integrating your patch. GitHub has `a short guide <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request>`_ that can walk you through the process. You should tag your issue number in the request so that they linked in GitHub’s system.