// SPDX-FileCopyrightText: 2017 P Burgess for Adafruit Industries
//
// SPDX-License-Identifier: MIT

/*!
 * @file Adafruit_NeoPXL8.cpp
 *
 * @mainpage 8-way concurrent DMA NeoPixel library for SAMD21, SAMD51,
 * RP2040, RP235x, and ESP32S3 microcontrollers.
 *
 * @section intro_sec Introduction
 *
 * Adafruit_NeoPXL8 is an Arduino library that leverages hardware features
 * unique to some Atmel SAMD21 and SAMD51 microcontrollers, plus the
 * Raspberry Pi RP2040, RP235x, and Espressif ESP32S3 (not S2, etc.) chips, to
 * communicate with large numbers of NeoPixels with very low CPU utilization
 * and without losing track of time. It was originally designed for the
 * Adafruit Feather M0 board with NeoPXL8 FeatherWing interface/adapter,
 * but may be applicable to other situations (e.g. Arduino Zero, Adafruit
 * Metro M0, etc., using logic level-shifting as necessary). A different
 * FeatherWing with pinout specific to the Feather M4 is also available.
 *
 * NeoPXL8 FeatherWing M0: https://www.adafruit.com/product/3249
 * NeoPXL8 FeatherWing M4: https://www.adafruit.com/product/4537
 *
 * Because the SAMD21 does not provide GPIO DMA, the code instead makes use
 * of the "pattern generator" peripheral for its 8 concurrent outputs.
 * Due to pin/peripheral multiplexing constraints, most outputs are limited
 * to SPECIFIC PINS or provide at most ONE ALTERNATE pin selection. See the
 * example code for details. The payoff is that this peripheral handles the
 * NeoPixel data transfer while the CPU is entirely free to render the next
 * frame (and interrupts can remain enabled -- millis()/micros() don't lose
 * time, and soft PWM (for servos, etc.) still operate normally).
 *
 * Additionally, NeoPXL8 has nondestructive brightness scaling...unlike
 * classic NeoPixel, getPixelColor() here always returns the original value
 * as was passed to setPixelColor().
 *
 * Adafruit_NeoPXL8HDR is a subclass of Adafruit_NeoPXL8 adding 16-bits-
 * per-channel color support, temporal dithering, frame blending and
 * gamma correction. This requires inordinate RAM, and the frequent need
 * for refreshing makes it best suited for multi-core chips (e.g. RP2040).
 *
 * RP2040 and RP235x support requires Philhower core (not Arduino mbed core).
 * Also on RP2040 and RP235x, pin numbers passed to constructor are GP##
 * indices, not necessarily the digital pin numbers silkscreened on the board.
 *
 * 0/1 bit timing does not precisely match NeoPixel/WS2812/SK6812 datasheet
 * specs, but it seems to work well enough. Use at your own peril.
 *
 * Some of the more esoteric NeoPixel functions are not implemented here, so
 * THIS IS NOT A 100% DROP-IN REPLACEMENT for all NeoPixel code right now.
 *
 * Adafruit invests time and resources providing this open source code,
 * please support Adafruit and open-source hardware by purchasing
 * products from Adafruit!
 *
 * @section dependencies Dependencies
 *
 * This library depends on
 * <a href="https://github.com/adafruit/Adafruit_NeoPixel">Adafruit_NeoPixel</a>
 * and (for SAMD chips)
 * <a href="https://github.com/adafruit/Adafruit_ZeroDMA">Adafruit_ZeroDMA</a>
 * being present on your system. Please make sure you have installed the
 * latest versions before using this library.
 *
 * @section author Author
 *
 * Written by Phil "Paint Your Dragon" Burgess for Adafruit Industries.
 *
 * @section license License
 *
 * MIT license, all text here must be included in any redistribution.
 *
 */

#include "Adafruit_NeoPXL8.h"
#include "wiring_private.h" // pinPeripheral() function

// SAMD DMA transfer using TCC0 as beat clock seems to stutter on the first
// few elements out, which can botch the delicate NeoPixel timing. A few
// initial zero bytes are issued to give DMA time to stabilize. The number
// of bytes here was determined empirically.
#define EXTRASTARTBYTES 24 ///< Empty bytes issued until DMA timing solidifies
// Not a perfect solution and you might still see infrequent glitches,
// especially on the first pixel of a strand. Often this is just a matter of
// logic levels -- SAMD is a 3.3V device, while NeoPixels want 5V logic --
// so either use a logic level shifter, or simply power the NeoPixels at a
// slightly lower voltage (e.g. 4.5V). It may also be due to the 1:3 bit
// timing generated by this code (close but doesn't exactly match the
// NeoPixel spec)...usually only affects the 1st pixel, subsequent pixels OK
// due to signal reshaping through the 1st.

static const int8_t defaultPins[] = NEOPXL8_DEFAULT_PINS;
static volatile bool sending = 0;     // Set while DMA transfer is active
static volatile uint32_t lastBitTime; // micros() when last bit issued

// NEOPXL8 CLASS -----------------------------------------------------------

Adafruit_NeoPXL8::Adafruit_NeoPXL8(uint16_t n, int8_t *p, neoPixelType t)
    : Adafruit_NeoPixel(n * 8, -1, t), brightness(256) {
  memcpy(pins, p ? p : defaultPins, sizeof(pins));
}

// A couple elements of the NeoPXL8 struct must be accessed in the DMA IRQ,
// which is outside the class. A pointer to the active NeoPXL8 is kept, so
// we can call a member function (also gets us around some protected access).
// This does mean only a single NeoPXL8 can be active, as on SAMD.
static Adafruit_NeoPXL8 *neopxl8_ptr = NULL;

#if defined(ARDUINO_ARCH_RP2040)
// note that ARDUINO_ARCH_RP2040 blocks also apply to RP235x

#define DMA_IRQ_N 1 ///< Can be 0 or 1, no functional difference, 1 looks cool

// PIO code. As currently written, uses 2/9 and 5/9 duty cycle for '0' and
// '1' bits respectively. This does not match the datasheet, but works well
// enough (actual NeoPixel output doesn't match the datasheet either,
// there's ample slop). But if this proves problematic, the delay values
// can be tweaked and the total cycles can be factored into the value
// passed to sm_config_set_clkdiv() later.
static const uint16_t neopxl8_opcodes[] = {
    //             .wrap_target
    0xA103, // 0: mov  pins, null  [1]  Write 8 parallel '0' bits, delay 1
    0x80A0, // 1: pull block            Wait on next byte from TX FIFO to OSR
    0xA10B, // 2: mov  pins, !null [1]  Write 8 parallel '1' bits, delay 1
    0xA307, // 3: mov  pins, osr   [3]  Write 8 parallel data bits, delay 3
            //     .wrap
};

static const struct pio_program neopxl8_program = {
    .instructions = neopxl8_opcodes,
    .length = sizeof neopxl8_opcodes / sizeof neopxl8_opcodes[0],
    .origin = -1,
};

// Called at end of DMA transfer. Clears 'sending' flag and notes start of
// NeoPixel latch time. Done as a callback (from the IRQ below) because it
// needs access to a protected NeoPXL8 member (dma_channel).
void Adafruit_NeoPXL8::dma_callback() {
  if (dma_irqn_get_channel_status(DMA_IRQ_N, dma_channel)) {
    dma_irqn_acknowledge_channel(DMA_IRQ_N, dma_channel); // Clear IRQ
    lastBitTime = micros();
    sending = 0;
  }
}

static void dma_finish_irq(void) {
  if (neopxl8_ptr) {
    neopxl8_ptr->dma_callback();
  }
}

#elif defined(CONFIG_IDF_TARGET_ESP32S3)

// Callback for end-of-DMA-transfer
static IRAM_ATTR bool dma_callback(gdma_channel_handle_t dma_chan,
                                   gdma_event_data_t *event_data,
                                   void *user_data) {
  // DMA callback seems to occur a moment before the last data has issued
  // (perhaps buffering between DMA and the LCD peripheral?), so pause a
  // moment before clearing the lcd_start flag. This figure was determined
  // empirically, not science...may need to increase if last-pixel trouble.
  esp_rom_delay_us(5);
  LCD_CAM.lcd_user.lcd_start = 0;
  // lastBitTime is NOT set in the callback because it would periodically
  // have a 'too early' value. Instead, it's set in the show() function
  // after the lcd_start flag is clear...which shouldn't make a difference,
  // but does. The result is that it's periodically 'too late' in that
  // case...but this just results in an infrequent slightly-long latch,
  // rather than a too-short one that could cause refresh problems.
  return true;
}

// Compatibility wrapper for GPIO configuration across ESP-IDF versions.
// In ESP-IDF v5.x, gpio_hal_iomux_func_sel() was removed and replaced with
// gpio_ll_func_sel(). This wrapper provides backward compatibility while
// supporting the new API.
#if defined(ESP_IDF_VERSION_MAJOR) && (ESP_IDF_VERSION_MAJOR >= 5)
static inline void _np8_set_pin_gpio(gpio_num_t gpio_num) {
  gpio_ll_func_sel(&GPIO, gpio_num, PIN_FUNC_GPIO);
}
#else
static inline void _np8_set_pin_gpio(gpio_num_t gpio_num) {
  gpio_hal_iomux_func_sel(GPIO_PIN_MUX_REG[gpio_num], PIN_FUNC_GPIO);
}
#endif

#else // SAMD

// This table holds PORTs, bits and peripheral selects of valid pattern
// generator waveform outputs. This data is not in the Arduino variant
// header...was derived from the SAM D21E/G/J datasheet. Some of these
// PORT/pin combos are NOT present on some dev boards or SAMD21 variants.
// If a pin is NOT in this list, it just means there's no TCC0/W[n] func
// there, but it may still exist and have other peripheral functions.
static struct {
  EPortType port;      // PORTA|PORTB
  uint8_t bit;         // Port bit (0-31)
  uint8_t wo;          // TCC0/WO# (0-7)
  EPioType peripheral; // Peripheral to select for TCC0 out
} tcc0pinMap[] = {
#ifdef __SAMD51__
    {PORTA, 8, 0, PIO_TIMER_ALT},  // FLASH_IO0 on Metro M4
    {PORTA, 9, 1, PIO_TIMER_ALT},  // FLASH_IO1
    {PORTA, 10, 2, PIO_TIMER_ALT}, // FLASH_IO2
    {PORTA, 11, 3, PIO_TIMER_ALT}, // FLASH_IO3
    {PORTA, 12, 6, PIO_TIMER_ALT}, // MOSI   PCC/DEN1  NOT WORKING?
    {PORTA, 13, 7, PIO_TIMER_ALT}, // SCK    PCC/DEN2
    //  PORTA  14 (no TCC0 function)  MISO   PCC/CLK (PIO_COM = peripheral
    //  G)
    {PORTA, 16, 4, PIO_TCC_PDEC},  // D13    PCC[0]
    {PORTA, 17, 5, PIO_TCC_PDEC},  // D12    PCC[1]
    {PORTA, 18, 6, PIO_TCC_PDEC},  // D10    PCC[2]
    {PORTA, 19, 7, PIO_TCC_PDEC},  // D11    PCC[3]
    {PORTA, 20, 0, PIO_TCC_PDEC},  // D9     PCC[4]
    {PORTA, 21, 1, PIO_TCC_PDEC},  // D8     PCC[5]
    {PORTA, 22, 2, PIO_TCC_PDEC},  // D0     PCC[6]
    {PORTA, 23, 3, PIO_TCC_PDEC},  // D1     PCC[7]
    {PORTB, 10, 4, PIO_TIMER_ALT}, // FLASH_SCK
    {PORTB, 11, 5, PIO_TIMER_ALT}, // FLASH_CS
    {PORTB, 12, 0, PIO_TCC_PDEC},  // D7
    {PORTB, 13, 1, PIO_TCC_PDEC},  // D4
    {PORTB, 14, 2, PIO_TCC_PDEC},  // D5     PCC[8]
    {PORTB, 15, 3, PIO_TCC_PDEC},  // D6     PCC[9]
    {PORTB, 16, 4, PIO_TCC_PDEC},  // D3
    {PORTB, 17, 5, PIO_TCC_PDEC},  // D2
    {PORTB, 30, 6, PIO_TCC_PDEC},  // SWO
    {PORTB, 31, 7, PIO_TCC_PDEC},  // NC
#else
    {PORTA, 4, 0, PIO_TIMER},      // A3 on Metro M0
    {PORTA, 5, 1, PIO_TIMER},      // A4
    {PORTA, 8, 0, PIO_TIMER},      // D4
    {PORTA, 9, 1, PIO_TIMER},      // D3
    {PORTA, 10, 2, PIO_TIMER_ALT}, // D1
    {PORTA, 11, 3, PIO_TIMER_ALT}, // D0
    {PORTA, 12, 6, PIO_TIMER_ALT}, // MISO
    {PORTA, 13, 7, PIO_TIMER_ALT}, // FLASH_CS
    {PORTA, 14, 4, PIO_TIMER_ALT}, // D2
    {PORTA, 15, 5, PIO_TIMER_ALT}, // D5
    {PORTA, 16, 6, PIO_TIMER_ALT}, // D11 (TCC func not in Rev A silicon)
    {PORTA, 17, 7, PIO_TIMER_ALT}, // D13 (TCC func not in Rev A silicon)
    {PORTA, 18, 2, PIO_TIMER_ALT}, // D10
    {PORTA, 19, 3, PIO_TIMER_ALT}, // D12
    {PORTA, 20, 6, PIO_TIMER_ALT}, // D6
    {PORTA, 21, 7, PIO_TIMER_ALT}, // D7
    {PORTA, 22, 4, PIO_TIMER_ALT}, // SDA
    {PORTA, 23, 5, PIO_TIMER_ALT}, // SCL
    {PORTB, 10, 4, PIO_TIMER_ALT}, // MOSI
    {PORTB, 11, 5, PIO_TIMER_ALT}, // SCK
    {PORTB, 12, 6, PIO_TIMER_ALT}, // NC
    {PORTB, 13, 7, PIO_TIMER_ALT}, // NC
    {PORTB, 16, 4, PIO_TIMER_ALT}, // NC
    {PORTB, 17, 5, PIO_TIMER_ALT}, // NC
    {PORTB, 30, 0, PIO_TIMER},     // NC
    {PORTB, 31, 1, PIO_TIMER}      // NC
#endif
};
#define PINMAPSIZE                                                             \
  (sizeof(tcc0pinMap) /                                                        \
   sizeof(tcc0pinMap[0])) ///< Number of elements in the tcc0pinMap[] array

// Given a pin number, locate corresponding entry in the pin map table
// above, configure as a pattern generator output and return bitmask
// for later data conversion (returns 0 if invalid pin).
static uint8_t configurePin(int8_t pin) {
  if ((pin >= 0) && (pin < PINS_COUNT)) {
    EPortType port = g_APinDescription[pin].ulPort;
    uint8_t bit = g_APinDescription[pin].ulPin;
    for (uint8_t i = 0; i < PINMAPSIZE; i++) {
      if ((port == tcc0pinMap[i].port) && (bit == tcc0pinMap[i].bit)) {
        pinPeripheral(pin, tcc0pinMap[i].peripheral);
        return (1 << tcc0pinMap[i].wo);
      }
    }
  }
  return 0;
}

// Called at end of DMA transfer. Clears 'sending' flag and notes
// start-of-NeoPixel-latch time.
static void dmaCallback(Adafruit_ZeroDMA *dma) {
  lastBitTime = micros();
  sending = 0;
}

#endif // end SAMD

Adafruit_NeoPXL8::~Adafruit_NeoPXL8() {
#if defined(ARDUINO_ARCH_RP2040)
  pio_sm_set_enabled(pio, sm, false);
  pio_remove_program(pio, &neopxl8_program, offset);
  pio_sm_unclaim(pio, sm);
  dma_channel_abort(dma_channel);
  dma_channel_unclaim(dma_channel);
  if (dmaBuf[0])
    free(dmaBuf[0]);
  irq_remove_handler(DMA_IRQ_N == 0 ? DMA_IRQ_0 : DMA_IRQ_1, dma_finish_irq);
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
  gdma_reset(dma_chan);
  if (allocAddr)
    heap_caps_free(allocAddr);
#else
  dma.abort();
  if (allocAddr)
    free(allocAddr);
#endif
  neopxl8_ptr = NULL;
}

bool Adafruit_NeoPXL8::begin(bool dbuf) {
  Adafruit_NeoPixel::begin(); // Call base class begin() function 1st
  if (pixels) {               // Successful malloc of NeoPixel buffer?
    uint8_t bytesPerPixel = (wOffset == rOffset) ? 3 : 4;

    memset(bitmask, 0, sizeof(bitmask));

    neopxl8_ptr = this; // Save object pointer for interrupt

#if defined(ARDUINO_ARCH_RP2040)
    // Validate pins, must be within any 8 consecutive GPIO bits
    int16_t least_pin = 0x7FFF, most_pin = -1;
    for (uint8_t i = 0; i < 8; i++) {
      if (pins[i] >= 0) {
        least_pin = min(least_pin, pins[i]);
        most_pin = max(most_pin, pins[i]);
      }
    }
    if (abs(most_pin - least_pin) > 7) {
      return false;
    }

    uint32_t buf_size = numLEDs * bytesPerPixel;
    uint32_t alloc_size = dbuf ? buf_size * 2 : buf_size;

    if ((dmaBuf[0] = (uint8_t *)malloc(alloc_size))) {

      // If no double buffering, point both to same space
      dmaBuf[1] = dbuf ? &dmaBuf[0][buf_size] : dmaBuf[0];

      // Set up PIO code & clock
      // Find a PIO with enough available space in its instruction memory
      pio = NULL;

      if (!pio_claim_free_sm_and_add_program_for_gpio_range(
              &neopxl8_program, &pio, &sm, &offset, least_pin, 8, true)) {
        pio = NULL;
        sm = -1;
        offset = 0;
        return false; // No PIO available
      }

      // offset = pio_add_program(pio, &neopxl8_program);
      // sm = pio_claim_unused_sm(pio, true); // 0-3
      pio_sm_config conf = pio_get_default_sm_config();
      conf.pinctrl = 0; // SDK fails to set this
      sm_config_set_wrap(&conf, offset, offset + neopxl8_program.length - 1);
      sm_config_set_out_shift(&conf, true, false, 8);
      sm_config_set_out_pins(&conf, least_pin, 8);
      sm_config_set_in_shift(&conf, true, false, 8);
      sm_config_set_fifo_join(&conf, PIO_FIFO_JOIN_TX);
      float div = (float)F_CPU / 800000.0 / 9.0; // 9 = PIO cycles/bit
      sm_config_set_clkdiv(&conf, div);
      pio_sm_init(pio, sm, offset, &conf);
      pio_sm_set_enabled(pio, sm, true);

      // Set up PIO outputs
      uint32_t pindir_mask = 0;
      for (uint8_t i = 0; i < 8; i++) {
        if (pins[i] >= 0) {
          pio_gpio_init(pio, pins[i]);
          gpio_set_drive_strength(pins[i], GPIO_DRIVE_STRENGTH_2MA);
          pindir_mask = 1 << pins[i];
          bitmask[i] = 1 << (pins[i] - least_pin);
        }
      }
      // Func not working? Or using it wrong?
      // pio_sm_set_pindirs_with_mask(pio, sm, pindir_mask, pindir_mask);
      // For now, set all 8 as outputs, even if in-betweens are skipped
      pio_sm_set_consecutive_pindirs(pio, sm, least_pin, 8, true);

      // Set up DMA transfer
      dma_channel = dma_claim_unused_channel(false); // Don't panic

      dma_config = dma_channel_get_default_config(dma_channel);
      channel_config_set_transfer_data_size(&dma_config, DMA_SIZE_8);
      channel_config_set_read_increment(&dma_config, true);
      channel_config_set_write_increment(&dma_config, false);
      // Set DMA trigger
      channel_config_set_dreq(&dma_config, pio_get_dreq(pio, sm, true));
      dma_channel_configure(dma_channel, &dma_config,
                            &pio->txf[sm],      // dest
                            dmaBuf[dbuf_index], // src
                            buf_size, false);
      // Set up end-of-DMA interrupt
      irq_add_shared_handler(DMA_IRQ_N == 0 ? DMA_IRQ_0 : DMA_IRQ_1,
                             dma_finish_irq,
                             PICO_SHARED_IRQ_HANDLER_DEFAULT_ORDER_PRIORITY);
#if (DMA_IRQ_N == 0)
      dma_channel_set_irq0_enabled(dma_channel, true);
#else
      dma_channel_set_irq1_enabled(dma_channel, true);
#endif
      irq_set_enabled(DMA_IRQ_N == 0 ? DMA_IRQ_0 : DMA_IRQ_1, true);

      return true; // Success!
    }

#elif defined(CONFIG_IDF_TARGET_ESP32S3)

    uint32_t xfer_size = numLEDs * bytesPerPixel * 3;
    uint32_t buf_size = xfer_size + 3;        // +3 for long align
    int num_desc = (xfer_size + 4094) / 4095; // sic. (NOT 4096)
    uint32_t alloc_size =
        num_desc * sizeof(dma_descriptor_t) + (dbuf ? buf_size * 2 : buf_size);

    if ((allocAddr = (uint8_t *)heap_caps_malloc(
             alloc_size, MALLOC_CAP_DMA | MALLOC_CAP_8BIT))) {

      // Find first 32-bit aligned address following descriptor list
      alignedAddr[0] =
          (uint32_t
               *)((uint32_t)(&allocAddr[num_desc * sizeof(dma_descriptor_t) +
                                        3]) &
                  ~3);
      dmaBuf[0] = (uint8_t *)alignedAddr[0];

      if (dbuf) {
        // Find 32-bit aligned address following first DMA buffer
        alignedAddr[1] =
            (uint32_t *)((uint32_t)(&alignedAddr[0][buf_size]) & ~3);
      } else {
        alignedAddr[1] = alignedAddr[0];
      }
      dmaBuf[1] = (uint8_t *)alignedAddr[1];

      // LCD_CAM isn't enabled by default -- MUST begin with this:
      periph_module_enable(PERIPH_LCD_CAM_MODULE);
      periph_module_reset(PERIPH_LCD_CAM_MODULE);

      // Reset LCD bus
      LCD_CAM.lcd_user.lcd_reset = 1;
      esp_rom_delay_us(100);

      // Configure LCD clock
      LCD_CAM.lcd_clock.clk_en = 1;             // Enable clock
      LCD_CAM.lcd_clock.lcd_clk_sel = 2;        // PLL240M source
      LCD_CAM.lcd_clock.lcd_clkm_div_a = 1;     // 1/1 fractional divide,
      LCD_CAM.lcd_clock.lcd_clkm_div_b = 1;     // plus '99' below yields...
      LCD_CAM.lcd_clock.lcd_clkm_div_num = 99;  // 1:100 prescale (2.4 MHz CLK)
      LCD_CAM.lcd_clock.lcd_ck_out_edge = 0;    // PCLK low in 1st half cycle
      LCD_CAM.lcd_clock.lcd_ck_idle_edge = 0;   // PCLK low idle
      LCD_CAM.lcd_clock.lcd_clk_equ_sysclk = 1; // PCLK = CLK (ignore CLKCNT_N)

      // Configure frame format
      LCD_CAM.lcd_ctrl.lcd_rgb_mode_en = 0;    // i8080 mode (not RGB)
      LCD_CAM.lcd_rgb_yuv.lcd_conv_bypass = 0; // Disable RGB/YUV converter
      LCD_CAM.lcd_misc.lcd_next_frame_en = 0;  // Do NOT auto-frame
      LCD_CAM.lcd_data_dout_mode.val = 0;      // No data delays
      LCD_CAM.lcd_user.lcd_always_out_en = 1;  // Enable 'always out' mode
      LCD_CAM.lcd_user.lcd_8bits_order = 0;    // Do not swap bytes
      LCD_CAM.lcd_user.lcd_bit_order = 0;      // Do not reverse bit order
      LCD_CAM.lcd_user.lcd_2byte_en = 0;       // 8-bit data mode
      LCD_CAM.lcd_user.lcd_dummy = 1;          // Dummy phase(s) @ LCD start
      LCD_CAM.lcd_user.lcd_dummy_cyclelen = 0; // 1 dummy phase
      LCD_CAM.lcd_user.lcd_cmd = 0;            // No command at LCD start
      // Dummy phase(s) MUST be enabled for DMA to trigger reliably.

      const uint8_t mux[] = {
          LCD_DATA_OUT0_IDX, LCD_DATA_OUT1_IDX, LCD_DATA_OUT2_IDX,
          LCD_DATA_OUT3_IDX, LCD_DATA_OUT4_IDX, LCD_DATA_OUT5_IDX,
          LCD_DATA_OUT6_IDX, LCD_DATA_OUT7_IDX,
      };

      // Route LCD signals to GPIO pins
      for (int i = 0; i < 8; i++) {
        if (pins[i] >= 0) {
          esp_rom_gpio_connect_out_signal(pins[i], mux[i], false, false);
          _np8_set_pin_gpio((gpio_num_t)pins[i]);
          gpio_set_drive_capability((gpio_num_t)pins[i], (gpio_drive_cap_t)3);
          bitmask[i] = 1 << i;
        }
      }

      // Set up DMA descriptor list (length and data are set before xfer)
      desc = (dma_descriptor_t *)allocAddr; // At start of alloc'd buffer
      for (int i = 0; i < num_desc; i++) {
        desc[i].dw0.owner = DMA_DESCRIPTOR_BUFFER_OWNER_DMA;
        desc[i].dw0.suc_eof = 0;
        desc[i].next = &desc[i + 1];
      }
      desc[num_desc - 1].dw0.suc_eof = 1;
      desc[num_desc - 1].next = NULL;

      // Alloc DMA channel & connect it to LCD periph
      gdma_channel_alloc_config_t dma_chan_config = {
          .sibling_chan = NULL,
          .direction = GDMA_CHANNEL_DIRECTION_TX,
          .flags = {.reserve_sibling = 0}};
      gdma_new_channel(&dma_chan_config, &dma_chan);
      gdma_connect(dma_chan, GDMA_MAKE_TRIGGER(GDMA_TRIG_PERIPH_LCD, 0));
      gdma_strategy_config_t strategy_config = {.owner_check = false,
                                                .auto_update_desc = false};
      gdma_apply_strategy(dma_chan, &strategy_config);

      // Enable DMA transfer callback
      gdma_tx_event_callbacks_t tx_cbs = {.on_trans_eof = dma_callback};
      gdma_register_tx_event_callbacks(dma_chan, &tx_cbs, NULL);

      return true; // Success!
    }

#else // SAMD

    // Double-buffered DMA out is currently NOT supported on SAMD.
    // Code's there but it causes weird flickering. All the pointer
    // work looks right, I'm just speculating that this might have
    // something to do with HDR refresh being timer interrupt-driven,
    // that certain elements of the class might need to be declared
    // volatile, which currently causes compilation mayhem.
    // What with the timer interrupt, and needing to share cycles
    // with the main thread of execution, I'm not sure it's helpful
    // on SAMD anyway, mostly an RP2040 thing.
    dbuf = false;

    uint32_t buf_size = numLEDs * bytesPerPixel * 3 + EXTRASTARTBYTES + 3;
    // uint32_t alloc_size = dbuf ? buf_size * 2 : buf_size;

    if ((allocAddr = (uint8_t *)malloc(buf_size))) {
      int i;

      dma.setTrigger(TCC0_DMAC_ID_OVF);
      dma.setAction(DMA_TRIGGER_ACTON_BEAT);

      // Get address of first byte that's on a 32-bit boundary and at least
      // EXTRASTARTBYTES into dmaBuf. This is where pixel data starts.
      alignedAddr[0] =
          (uint32_t *)((uint32_t)(&allocAddr[EXTRASTARTBYTES + 3]) & ~3);

      // DMA transfer then starts EXTRABYTES back from this to stabilize
      dmaBuf[0] = (uint8_t *)alignedAddr[0] - EXTRASTARTBYTES;
      memset(dmaBuf[0], 0, EXTRASTARTBYTES); // Initialize start with zeros

      if (dbuf) {
        alignedAddr[1] =
            (uint32_t
                 *)((uint32_t)(&allocAddr[buf_size + EXTRASTARTBYTES + 3]) &
                    ~3);
        dmaBuf[1] = (uint8_t *)alignedAddr[1] - EXTRASTARTBYTES;
        memset(dmaBuf[1], 0, EXTRASTARTBYTES);
      } else {
        alignedAddr[1] = alignedAddr[0];
        dmaBuf[1] = dmaBuf[0];
      }

      uint8_t *dst = &((uint8_t *)(&TCC0->PATT))[1]; // PAT.vec.PGV
      dma.allocate();
      desc = dma.addDescriptor(dmaBuf[dbuf_index], // source
                               dst,                // destination
                               buf_size -
                                   3, // count (don't include alignment bytes!)
                               DMA_BEAT_SIZE_BYTE, // size per
                               true,               // increment source
                               false); // don't increment destination

      dma.setCallback(dmaCallback);

#ifdef __SAMD51__
      // Set up generic clock gen 2 as source for TCC0
      // Datasheet recommends setting GENCTRL register in a single write,
      // so a temp value is used here to more easily construct a value.
      GCLK_GENCTRL_Type genctrl;
      genctrl.bit.SRC = GCLK_GENCTRL_SRC_DFLL_Val; // 48 MHz source
      genctrl.bit.GENEN = 1;                       // Enable
      genctrl.bit.OE = 1;
      genctrl.bit.DIVSEL = 0; // Do not divide clock source
      genctrl.bit.DIV = 0;
      GCLK->GENCTRL[2].reg = genctrl.reg;
      while (GCLK->SYNCBUSY.bit.GENCTRL1 == 1)
        ;

      GCLK->PCHCTRL[TCC0_GCLK_ID].bit.CHEN = 0;
      while (GCLK->PCHCTRL[TCC0_GCLK_ID].bit.CHEN)
        ; // Wait for disable
      GCLK_PCHCTRL_Type pchctrl;
      pchctrl.bit.GEN = GCLK_PCHCTRL_GEN_GCLK2_Val;
      pchctrl.bit.CHEN = 1;
      GCLK->PCHCTRL[TCC0_GCLK_ID].reg = pchctrl.reg;
      while (!GCLK->PCHCTRL[TCC0_GCLK_ID].bit.CHEN)
        ; // Wait for enable
#else
      // Enable GCLK for TCC0
      GCLK->CLKCTRL.reg =
          (uint16_t)(GCLK_CLKCTRL_CLKEN | GCLK_CLKCTRL_GEN_GCLK0 |
                     GCLK_CLKCTRL_ID(GCM_TCC0_TCC1));
      while (GCLK->STATUS.bit.SYNCBUSY == 1)
        ;
#endif

      // Disable TCC before configuring it
      TCC0->CTRLA.bit.ENABLE = 0;
      while (TCC0->SYNCBUSY.bit.ENABLE)
        ;

      TCC0->CTRLA.bit.PRESCALER = TCC_CTRLA_PRESCALER_DIV1_Val; // 1:1 Prescale

      TCC0->WAVE.bit.WAVEGEN = TCC_WAVE_WAVEGEN_NPWM_Val; // Normal PWM mode
      while (TCC0->SYNCBUSY.bit.WAVE)
        ;

      TCC0->CC[0].reg = 0; // No PWM out
      while (TCC0->SYNCBUSY.bit.CC0)
        ;

        // 2.4 GHz clock: 3 DMA xfers per NeoPixel bit = 800 KHz
#ifdef __SAMD51__
      TCC0->PER.reg = ((48000000 + 1200000) / 2400000) - 1;
#else
      TCC0->PER.reg = ((F_CPU + 1200000) / 2400000) - 1;
#endif
      while (TCC0->SYNCBUSY.bit.PER)
        ;

      uint8_t enableMask = 0x00; // Bitmask of pattern gen outputs
      for (i = 0; i < 8; i++) {
        if ((bitmask[i] = configurePin(pins[i]))) // assign AND test!
          enableMask |= bitmask[i];
      }
      TCC0->PATT.vec.PGV = 0; // Set all pattern outputs to 0
      while (TCC0->SYNCBUSY.bit.PATT)
        ;
      TCC0->PATT.vec.PGE = enableMask; // Enable pattern outputs
      while (TCC0->SYNCBUSY.bit.PATT)
        ;

      TCC0->CTRLA.bit.ENABLE = 1;
      while (TCC0->SYNCBUSY.bit.ENABLE)
        ;

      return true; // Success!
    }

#endif // end SAMD

    free(pixels);
    pixels = NULL;
  }

  return false;
}

// Convert NeoPixel buffer to NeoPXL8 output format
void Adafruit_NeoPXL8::stage(void) {

  uint8_t bytesPerLED = (wOffset == rOffset) ? 3 : 4;
  uint32_t pixelsPerRow = numLEDs / 8, bytesPerRow = pixelsPerRow * bytesPerLED,
           i;

#if defined(ARDUINO_ARCH_RP2040)

  memset(dmaBuf[dbuf_index], 0, numLEDs * bytesPerLED);

  for (uint8_t b = 0; b < 8; b++) { // For each output pin 0-7
    uint8_t mask = bitmask[b];
    if (mask) {                                // Enabled?
      uint8_t *src = &pixels[b * bytesPerRow]; // Start of row data
      uint8_t *dst = dmaBuf[dbuf_index];
      for (i = 0; i < bytesPerRow; i++) { // Each byte in row...
        // Brightness scaling doesn't require shift down,
        // we'll just pluck from bits 15-8...
        uint16_t value = *src++ * brightness;
        if (value & 0x8000)
          dst[0] |= mask;
        if (value & 0x4000)
          dst[1] |= mask;
        if (value & 0x2000)
          dst[2] |= mask;
        if (value & 0x1000)
          dst[3] |= mask;
        if (value & 0x0800)
          dst[4] |= mask;
        if (value & 0x0400)
          dst[5] |= mask;
        if (value & 0x0200)
          dst[6] |= mask;
        if (value & 0x0100)
          dst[7] |= mask;
        dst += 8;
      }
    }
  }

#else // SAMD or ESP32S3

  static const uint8_t dmaFill[] __attribute__((__aligned__(4))) = {
      0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00,
      0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00};

  // Clear DMA buffer data (32-bit writes are used to save a few cycles)
  uint32_t *in = (uint32_t *)dmaFill, *out = alignedAddr[dbuf_index];
  for (i = 0; i < bytesPerRow; i++) {
    *out++ = in[0];
    *out++ = in[1];
    *out++ = in[2];
    *out++ = in[3];
    *out++ = in[4];
    *out++ = in[5];
  }

  for (uint8_t b = 0; b < 8; b++) { // For each output pin 0-7
    uint8_t mask = bitmask[b];
    if (mask) {                                // Enabled?
      uint8_t *src = &pixels[b * bytesPerRow]; // Start of row data
      uint8_t *dst = &((uint8_t *)alignedAddr[dbuf_index])[1];
      for (i = 0; i < bytesPerRow; i++) { // Each byte in row...
        // Brightness scaling doesn't require shift down,
        // we'll just pluck from bits 15-8...
        uint16_t value = *src++ * brightness;
        if (value & 0x8000)
          dst[0] |= mask;
        if (value & 0x4000)
          dst[3] |= mask;
        if (value & 0x2000)
          dst[6] |= mask;
        if (value & 0x1000)
          dst[9] |= mask;
        if (value & 0x0800)
          dst[12] |= mask;
        if (value & 0x0400)
          dst[15] |= mask;
        if (value & 0x0200)
          dst[18] |= mask;
        if (value & 0x0100)
          dst[21] |= mask;
        dst += 24;
      }
    }
  }

#endif // end SAMD/ESP32S3

  staged = true;
}

void Adafruit_NeoPXL8::show(void) {
  if (dmaBuf[0] == dmaBuf[1]) {
    // Single-buffered operation. Must wait for current DMA transfer to
    // complete before staging new data in the buffer, or it may get
    // corrupted in mid-transfer.
#if defined(CONFIG_IDF_TARGET_ESP32S3)
    while (LCD_CAM.lcd_user.lcd_start)
      ; // Wait for DMA IRQ
    lastBitTime = micros();
#else
    while (sending)
      ; // Wait for DMA IRQ
#endif
    if (!staged)
      stage(); // Convert data
  } else {
    // Double-buffered operation, new data can be staged in alternating
    // buffer while the current DMA transfer is in-progress.
    if (!staged)
      stage(); // Convert data
      // Still have to wait for DMA to finish before latch check though.
#if defined(CONFIG_IDF_TARGET_ESP32S3)
    while (LCD_CAM.lcd_user.lcd_start)
      ; // Wait for DMA IRQ
    lastBitTime = micros();
#else
    while (sending)
      ; // Wait for DMA IRQ
#endif
  }
  staged = false;
  sending = 1;

#if defined(ARDUINO_ARCH_RP2040)

  // Reset DMA source address for next transfer.
  // Not sure what's up here, but if we don't delay a moment before
  // changing the DMA read address, the last byte out is corrupted.
  // It's possible the DMA callback gets invoked at the start of the
  // last byte out, rather than end, or might have to do with the
  // PIO FIFOs or something. Regardless, 10 uS fixes it.
  if (dmaBuf[0] != dmaBuf[1])
    delayMicroseconds(10);
  dma_channel_set_read_addr(dma_channel, dmaBuf[dbuf_index], false);

  pio_sm_clear_fifos(pio, sm);                  // Clear TX FIFO just in case
  while ((micros() - lastBitTime) <= latchtime) // Wait for latch
    ;
  dma_channel_start(dma_channel); // Start new transfer

#elif defined(CONFIG_IDF_TARGET_ESP32S3)

  gdma_reset(dma_chan);
  LCD_CAM.lcd_user.lcd_dout = 1;
  LCD_CAM.lcd_user.lcd_update = 1;
  LCD_CAM.lcd_misc.lcd_afifo_reset = 1;

  uint8_t bytesPerPixel = (wOffset == rOffset) ? 3 : 4;
  uint32_t xfer_size = numLEDs * bytesPerPixel * 3;
  int num_desc = (xfer_size + 4094) / 4095; // sic. (NOT 4096)

  int bytesToGo = xfer_size;
  int offset = 0;
  for (int i = 0; i < num_desc; i++) {
    int bytesThisPass = bytesToGo;
    if (bytesThisPass > 4095)
      bytesThisPass = 4095;
    desc[i].dw0.size = desc[i].dw0.length = bytesThisPass;
    desc[i].buffer = &dmaBuf[dbuf_index][offset];
    bytesToGo -= bytesThisPass;
    offset += bytesThisPass;
  }

  while ((micros() - lastBitTime) <= latchtime) // Wait for latch
    ;

  gdma_start(dma_chan, (intptr_t)&desc[0]);
  esp_rom_delay_us(1);
  LCD_CAM.lcd_user.lcd_start = 1; // Begin LCD DMA xfer

#else // SAMD

  // Reset DMA source address for next transfer
  dma.changeDescriptor(desc, dmaBuf[dbuf_index], NULL, 0);

  dma.startJob();
  // Wait for latch, factor in EXTRASTARTBYTES transmission time too!
  while ((micros() - lastBitTime) <=
         ((uint32_t)latchtime - (EXTRASTARTBYTES * 5 / 4)))
    ;
  dma.trigger(); // Start new transfer

#endif // end SAMD

  dbuf_index ^= 1; // Swap buffer index for next staging pass
}

// Returns true if DMA transfer is NOT presently occurring.
// We MAY (or not) be in the EOD latch time, or might be idle.
// Either way, it's now safe to stage data from NeoPixel to DMA buffer.
// This might be helpful for code that wants more precise and uniform
// animation timing...it might be using a timer interrupt or micros()
// delta for frame-to-frame intervals...after calculating the next frame,
// one can 'stage' the data (convert it from NeoPixel buffer format to
// DMA parallel output format) once the current frame has finished
// transmitting, rather than being done at the beginning of the show()
// function (the staging conversion isn't entirely deterministic).
bool Adafruit_NeoPXL8::canStage(void) const {
  // If double-buffering enabled, can always stage
  return (dmaBuf[0] != dmaBuf[1]) || !sending;
}

// Returns true if DMA transfer is NOT presently occurring and
// NeoPixel EOD latch has fully transpired; library is idle.
bool Adafruit_NeoPXL8::canShow(void) const {
  return !sending && ((micros() - lastBitTime) > latchtime);
}

// NEOPXL8HDR CLASS --------------------------------------------------------

Adafruit_NeoPXL8HDR::Adafruit_NeoPXL8HDR(uint16_t n, int8_t *p, neoPixelType t)
    : Adafruit_NeoPXL8(n, p, t) {}

Adafruit_NeoPXL8HDR::~Adafruit_NeoPXL8HDR() {
  if (dither_table)
    free(dither_table);
  if (pixel_buf[0])
    free(pixel_buf[0]);
}

bool Adafruit_NeoPXL8HDR::begin(bool blend, uint8_t bits, bool dbuf) {
  // If blend flag is set, allocate 3X pixel buffers, else 2X (for
  // temporal dithering only). Result is the buffer size in 16-bit
  // words (not bytes).
  uint32_t buf_size = numBytes * (blend ? 3 : 2);

  dither_bits = (bits > 8) ? 8 : bits;

  if ((pixel_buf[0] = (uint16_t *)malloc(buf_size * sizeof(uint16_t)))) {
    if ((dither_table =
             (uint16_t *)malloc((1 << dither_bits) * sizeof(uint16_t)))) {
      if (Adafruit_NeoPXL8::begin(dbuf)) {
#if defined(ARDUINO_ARCH_RP2040)
        mutex_init(&mutex);
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
        mutex = xSemaphoreCreateMutex();
#endif // end ESP32S3/RP2040

        // All allocations & initializations were successful.
        // Generate bit-flip table for ordered dithering...
        for (int i = 0; i < (1 << dither_bits); i++) {
          uint16_t result = 0;
          for (uint8_t bit = 0; bit < dither_bits; bit++) {
            result = (result << 1) | ((i >> bit) & 1);
          }
          dither_table[i] = result << (16 - dither_bits);
        }
        setBrightness(65535, 1.0); // Sets up gamma LUT (max bright, linear)
        memset(pixel_buf[0], 0, buf_size * sizeof(uint16_t));
        if (blend) {
          // 3 pixel buffers (2 for blending & dithering, plus original)
          pixel_buf[1] = &pixel_buf[0][numBytes];
        } else {
          // 2 pixel buffers (1 for dithering, 1 for original), but first 2
          // indices of pixel_buf point to the same buffer so we can process
          // it the same as when blending. Then index 2 always points to the
          // original, whether blending or not.
          pixel_buf[1] = pixel_buf[0];
        }
        // Buf index 2 is the "original" pixel data that setPixelColor()
        // acts on. It's maintained as a separate copy because there may be
        // multiple calls to refresh() to handle dithering & blending while
        // a new frame is being rendered, and we don't want interim results
        // to "tear" the image.
        pixel_buf[2] = &pixel_buf[1][numBytes];
        return true; // Good to go!
      }
      // If NeoPXL8::begin() failed, free any interim allocations.
      free(dither_table);
      dither_table = NULL;
    }
    free(pixel_buf[0]);
    pixel_buf[0] = NULL;
  }
  return false;
}

void Adafruit_NeoPXL8HDR::setBrightness(uint8_t b) {
  // Set RGBW, keep existing gamma
  uint16_t b16 = b * 257; // 257 (not 256) is intentional; see setPixelColor()
  setBrightness(b16, b16, b16, b16, gfactor);
}

void Adafruit_NeoPXL8HDR::setBrightness(uint16_t b, float y) {
  // Set RGBW + gamma
  setBrightness(b, b, b, b, y);
}

void Adafruit_NeoPXL8HDR::setBrightness(uint16_t r, uint16_t g, uint16_t b) {
  // Set RGB, keep existing W + gamma
  setBrightness(r, g, b, brightness_rgbw[3], gfactor);
}

void Adafruit_NeoPXL8HDR::setBrightness(uint16_t r, uint16_t g, uint16_t b,
                                        uint16_t w) {
  // Set RGBW, keep existing gamma
  setBrightness(r, g, b, w, gfactor);
}

void Adafruit_NeoPXL8HDR::setBrightness(uint16_t r, uint16_t g, uint16_t b,
                                        float y) {
  // Set RGB+gamma, keep existing W
  setBrightness(r, g, b, brightness_rgbw[3], y);
}

void Adafruit_NeoPXL8HDR::setBrightness(uint16_t r, uint16_t g, uint16_t b,
                                        uint16_t w, float y) {
  // Set RGBW+gamma, recalc table
  brightness_rgbw[0] = r;
  brightness_rgbw[1] = g;
  brightness_rgbw[2] = b;
  brightness_rgbw[3] = w;
  gfactor = y;
  calc_gamma_table();
}

void Adafruit_NeoPXL8HDR::calc_gamma_table(void) {
  for (uint8_t c = 0; c < 4; c++) { // R, G, B, W component
    // This is normal and intentional here that the peak value is scaled
    // down very slightly. Each lookup table entry represents both a base
    // 8-bit brightness level (0-255) and an 8-bit probability of "dithering
    // up" to the next level. Since there's nowhere "above" 255 to dither
    // (else it would roll over), at maximum brightness the topmost entry
    // should be 0xFF00. We could either clip the top of the range or scale
    // throughout. Since a gamma curve is also likely being applied anyway,
    // this code opts for scale. This results in up to 65281 (not 65536)
    // possible levels at full brightness. Since dithering is usually well
    // under 8 bits, some of this gets truncated on output anyway, all good.
    // A tiny bit of linearity is snuck in so we don't have a bunch of 0
    // elements at the bottom.
    float top = (float)(brightness_rgbw[c] * 0xFF00UL / 0xFFFF);
    // There's only 256 elements in the gamma table, as a full 16-bit table
    // would be inordinately large. In-between values are interpolated.
    for (int i = 0; i < 256; i++) {
      g16[c][i] =
          i + uint16_t(pow((float)i / 255.0, gfactor) * (top - i) + 0.5);
    }
  }
}

void Adafruit_NeoPXL8HDR::show(void) {
  // Called from the main thread of execution. New pixel data (via
  // setPixelColor()) is loaded, but no blend/dither/refresh cycle occurs --
  // that must be done with separate calls to refresh(). Originally had this
  // fall through to the blend/dither code, but syncing the two threads both
  // vying for dither access got ugly fast. Simpler as distinct behaviors.
#if defined(ARDUINO_ARCH_RP2040)
  mutex_enter_blocking(&mutex); // Sync w/refresh() on other core
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
  xSemaphoreTake(mutex, 100);
#else
  noInterrupts();
#endif
  memcpy(pixel_buf[stage_index], pixel_buf[2], numBytes * sizeof(uint16_t));
  if (pixel_buf[0] != pixel_buf[1]) { // Blending enabled?
    stage_index ^= 1;                 // Ping-pong the staging buffers
  }
#if defined(ARDUINO_ARCH_RP2040)
  mutex_exit(&mutex); // refresh() can resume
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
  xSemaphoreGive(mutex);
#else
  interrupts();
#endif
  new_pixels = true; // Next true pass, don't blend new data, show at 100%
}

// 32-bit math requires some tradeoff between the accuracy of frame blending
// and the maximum blend period that can be supported. BSHIFT determines
// these limits. A value of 4 allows up to ~1 sec max blend time with about
// 4K distinct blend states possible, while 6 allows up to ~4 sec / 1K,
// still more than enough (temporal dithering is usu. coarser than this).
#define BSHIFT 6 ///< Bit-shift in fixed-point math
#define BLEND_MAX_USEC ((0xFFFFFFFF / 0xFF01) << BSHIFT) ///< Resulting max

// Called from a second core or a timer interrupt. Blending and dithering
// occurs, but no new pixel data is loaded, just iterating.
void Adafruit_NeoPXL8HDR::refresh(void) {

  if (pixel_buf[2]) { // Don't allow refresh until begin() is finished

    uint32_t now = micros();
    uint32_t elapsed = now - last_show_time;
    // Need to limit this to avoid 32-bit overflow later
    if (elapsed > BLEND_MAX_USEC)
      elapsed = BLEND_MAX_USEC;
    if (new_pixels) {
      new_pixels = false;
      avg_show_interval = ((avg_show_interval * 7) + elapsed + 4) / 8;
      last_show_time = now;
      elapsed = 0;
    }
#if defined(ARDUINO_ARCH_RP2040)
    mutex_enter_blocking(&mutex); // Wait on show() on other thread
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
    xSemaphoreTake(mutex, 100);
#endif
    uint16_t *p1 = pixel_buf[stage_index];     // Prev pixels
    uint16_t *p2 = pixel_buf[1 - stage_index]; // Next pixels
#if defined(ARDUINO_ARCH_RP2040)
    mutex_exit(&mutex); // Thx, back to you...
#elif defined(CONFIG_IDF_TARGET_ESP32S3)
    xSemaphoreGive(mutex);
#endif

    // Blend and/or dither from p1 & p2 into pixels[]

    uint16_t weight1, weight2;            // Current/next pixel blend weights
    if (pixel_buf[0] != pixel_buf[1]) {   // Temporal blending?
      if (elapsed >= avg_show_interval) { // At or past end of blend
        weight2 = 0xFF01;                 // Next pixels contribute 100%
      } else {                            // Start or part way through blend
        weight2 = 0xFF01 * (elapsed >> BSHIFT) / (avg_show_interval >> BSHIFT);
        // Note to Future Self: keep this fixed-point, don't float it!
      }
    } else {
      weight2 = 0;
    }
    weight1 = 0xFF01 - weight2;
    // Sum of weight1+2 is always 65281 (0xFF01, *not* 0xFFFF or 0xFF00), on
    // purpose and by design. Blend of 16-bit pixel values by these weights
    // yields a 32-bit result (max 0xFF0000FF) that, shifted right 16 bits,
    // has a max of 0xFF00. Gamma table has 256 entries, so 16-bit colors
    // interpolate between positions -- the lower table index being the
    // upper byte of the blended result (up to 255, last index in table),
    // and the next table entry weighted by the lower byte of the blended
    // result. This way, the maximum blended pixel brightness (65535) uses
    // gamma entry 255, with no weight to the nonexistent subsequent element.
    // It maxes out the available range and avoids fencepost errors. This
    // means we actually get a bit fewer than 65536 colors, but since there's
    // coarser temporal dithering going on anyway, these tiny differences get
    // quantized away anyway, no great loss.

    uint16_t d = dither_table[dither_index];
    uint16_t dither_mask = (uint16_t)((1 << dither_bits) - 1)
                           << (16 - dither_bits);
    uint8_t *p; // NeoPixel dest buf
    uint8_t idx, w2;
    uint32_t c; // R/G/B/W component

    if (wOffset == rOffset) { // Is an RGB-type strip, 3 bytes/pixel
      for (uint32_t i = 0; i < numBytes; i += 3) {
        p = &pixels[i]; // -> NeoPixel lib buffer (8-bit)

        // Blend values between p1 & p2 buffers (if blending is disabled,
        // p1 & p2 both point to the same data, so we don't need separate
        // code for blended vs not).

        c = *p1++ * weight1 + *p2++ * weight2; // 32-bit result
        // Determine base index into gamma table (high byte of 32-bit
        // result), and weighting of next gamma entry.
        idx = c >> 24; // High byte = base gamma table index
        w2 = c >> 16;  // Mid-byte = next-entry weight
        c = g16[0][idx] * (256 - w2) + g16[0][idx + 1] * w2;
        p[rOffset] = (c >> 16) + ((c & dither_mask) > d);
        // w2 (and its implied inverse) are gamma table weights. Their sum is
        // always 256, but w2 only goes up to 255, again on purpose and by
        // design. The weight of the second entry should be at most 255/256 --
        // if it were 256/256, we'd just +1 the base index and use 0 for w2;

        // Same operation, green channel
        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[1][idx] * (256 - w2) + g16[1][idx + 1] * w2;
        p[gOffset] = (c >> 16) + ((c & dither_mask) > d);

        // Same operation, blue channel
        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[2][idx] * (256 - w2) + g16[2][idx + 1] * w2;
        p[bOffset] = (c >> 16) + ((c & dither_mask) > d);
      }
    } else { // Is a WRGB-type strip, 4 bytes/pixel
      for (uint32_t i = 0; i < numBytes; i += 4) {
        // Same as above, with added W channel
        p = &pixels[i]; // -> NeoPixel lib buffer (8-bit)

        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[0][idx] * (256 - w2) + g16[0][idx + 1] * w2;
        p[rOffset] = (c >> 16) + ((c & dither_mask) > d);

        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[1][idx] * (256 - w2) + g16[1][idx + 1] * w2;
        p[gOffset] = (c >> 16) + ((c & dither_mask) > d);

        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[2][idx] * (256 - w2) + g16[2][idx + 1] * w2;
        p[bOffset] = (c >> 16) + ((c & dither_mask) > d);

        c = *p1++ * weight1 + *p2++ * weight2;
        idx = c >> 24;
        w2 = c >> 16;
        c = g16[3][idx] * (256 - w2) + g16[3][idx + 1] * w2;
        p[wOffset] = (c >> 16) + ((c & dither_mask) > d);
      }
    }

    Adafruit_NeoPXL8::show();

    // Cycle dither probability. When it rolls over, update FPS estimate.
    if (++dither_index >= (1 << dither_bits)) {
      dither_index = 0;
      elapsed = now - last_fps_time; // Microseconds since last dither rollover
      if (elapsed)                   // Avoid /0 just in case
        fps = ((fps * 7) + ((1000000UL << dither_bits) / elapsed) + 4) / 8;
      last_fps_time = now;
    }

  } // end if (pixel_buf[2])
}

// SOME VALUABLE NOTES ABOUT setPixelColor() AND getPixelColor() FUNCTIONS:
// - These are provided for compatibility with existing NeoPixel or NeoPXL8
//   sketches moved directly to NeoPXL8HDR. New code may prefer set16()
//   and get16() instead, which use 16-bit components.
// - The multiplication by 257 (not 256) in these functions is INTENTIONAL.
//   This is correct for expanding an 8-bit value to 16-bit while fully
//   saturating the numeric range (e.g. 0xFF becomes 0xFFFF, not 0xFF00).
//   Any NeoPXL8-capable MCU will have single-cycle multiply, there is no
//   need to "optimize" this down to 256 or a shift operation. None.
// - pixel_buf pixels are ALWAYS in RGB (or RGBW) order, which simplifies
//   the store operations to fixed offsets (channel reordering happens
//   during the dither operation).
// - Although each of these COULD just multiply r/g/b/w by 257 and call
//   set16(), instead each does the full expand-and-store on its own.
//   These functions will likely be called a LOT from old carry-over
//   NeoPixel projects, so there's some benfit in optimizing out the added
//   function call, and the functions really aren't that large.

void Adafruit_NeoPXL8HDR::setPixelColor(uint16_t n, uint8_t r, uint8_t g,
                                        uint8_t b) {
  if (n < numLEDs) {
    uint16_t *p;
    if (wOffset == rOffset) {   // RGB strip
      p = &pixel_buf[2][n * 3]; // 3 words/pixel
    } else {                    // RGBW strip
      p = &pixel_buf[2][n * 4]; // 4 words/pixel
      p[3] = 0;                 // But only R,G,B passed -- set W to 0
    }
    p[0] = r * 257; // Yes, 257, see notes above
    p[1] = g * 257;
    p[2] = b * 257;
  }
}

void Adafruit_NeoPXL8HDR::setPixelColor(uint16_t n, uint8_t r, uint8_t g,
                                        uint8_t b, uint8_t w) {
  if (n < numLEDs) {
    uint16_t *p;
    if (wOffset == rOffset) {   // RGB strip
      p = &pixel_buf[2][n * 3]; // 3 words/pixel (ignore W)
    } else {                    // RGBW strip
      p = &pixel_buf[2][n * 4]; // 4 words/pixel
      p[3] = w * 257;           // Store W
    }
    p[0] = r * 257; // Yes, 257, see notes above
    p[1] = g * 257;
    p[2] = b * 257;
  }
}

void Adafruit_NeoPXL8HDR::setPixelColor(uint16_t n, uint32_t c) {
  if (n < numLEDs) {
    uint8_t r = (uint8_t)(c >> 16), g = (uint8_t)(c >> 8), b = (uint8_t)c;
    uint16_t *p;
    if (wOffset == rOffset) {         // RGB strip
      p = &pixel_buf[2][n * 3];       // 3 words/pixel
    } else {                          // RGBW strip
      p = &pixel_buf[2][n * 4];       // 4 words/pixel
      uint8_t w = (uint8_t)(c >> 24); // Extract and
      p[3] = w * 257;                 // store W
    }
    p[0] = r * 257; // Yes, 257, see notes above
    p[1] = g * 257;
    p[2] = b * 257;
  }
}

void Adafruit_NeoPXL8HDR::set16(uint16_t n, uint16_t r, uint16_t g, uint16_t b,
                                uint16_t w) {
  if (n < numLEDs) {
    uint16_t *p;
    if (wOffset == rOffset) {   // RGB strip
      p = &pixel_buf[2][n * 3]; // 3 words/pixel
    } else {                    // RGBW strip
      p = &pixel_buf[2][n * 4]; // 4 words/pixel
      p[3] = w;
    }
    p[0] = r; // Internal represenation is always RGBW,
    p[1] = g; // no need for ordering (happens on output)
    p[2] = b;
  }
}

// The 8-bit shifts in this function are INTENTIONAL, not dividing by 257
// to reverse the setPixelColor() multiplication. Quantization down is a
// different principle, and the result will be the same as any 8-bit values
// passed to the set functions. Also, shift is single-cycle, but division
// is not.
uint32_t Adafruit_NeoPXL8HDR::getPixelColor(uint16_t n) const {
  if (n < numLEDs) {
    uint16_t *p;
    if (wOffset == rOffset) { // RGB strip
      p = &pixel_buf[2][n * 3];
      return ((uint32_t)(p[0] & 0xFF00) << 8) | (uint32_t)(p[1] & 0xFF00) |
             ((uint32_t)(p[2] & 0xFF00) >> 8);
    } else { // RGBW strip
      p = &pixel_buf[2][n * 4];
      return ((uint32_t)(p[0] & 0xFF00) << 8) | (uint32_t)(p[1] & 0xFF00) |
             ((uint32_t)(p[2] & 0xFF00) >> 8) |
             ((uint32_t)(p[3] & 0xFF00) << 16);
    }
  }
  return 0; // Index out of range, return no color
}

void Adafruit_NeoPXL8HDR::get16(uint16_t n, uint16_t *r, uint16_t *g,
                                uint16_t *b, uint16_t *w) const {
  if (n < numLEDs) {
    uint16_t *p;
    if (wOffset == rOffset) {   // RGB strip
      p = &pixel_buf[2][n * 3]; // 3 words/pixel
      if (w)
        *w = 0;                 // If w passed, clear it
    } else {                    // RGBW strip
      p = &pixel_buf[2][n * 4]; // 4 words/pixel
      if (w)
        *w = p[3]; // Store w
    }
    *r = p[0]; // Internal represenation is always RGBW,
    *g = p[1]; // no need for ordering (happens on output)
    *b = p[2];
  } else { // Index out of bounds, return no color
    *r = *g = *b = 0;
    if (w)
      *w = 0;
  }
}

/*--------------------------------------------------------------------------
Some notes on How It Works (and doesn't work):

SAMD21 DMA has no path to the GPIO PORT registers. Instead, one of the
DMA-capable peripherals is exploited for byte-wide concurrent output
(specifically the TCC0 pattern generator, which is normally used for
motor control or some such). Although SAMD51 does have PORT DMA, the
pattern generator approach is used there regardless, so similar code
can be used for both chips. On RP2040 and RP235x, PIO code is used.

To issue 8 bits in parallel, all bytes of NeoPixel data must be "turned
sideways" in RAM so all the bit 7's are issued concurrently, then all
the bit 6's, bit 5's and so forth. Not a problem, and in fact we use this
opportunity to remap pins to strips (e.g. any of the 8 pins can be the
"first" of the pixels in RAM, and so forth, so you can do routing/wiring
however's easiest). On SAMD, timer/counter is used to issue the data at
a measured rate as required of NeoPixels.

The bad news for SAMD is that the high and low states at the start/end of
each NeoPixel bit are also issued this way and need to be part of the DMA
output buffer, and this incurs a hefty RAM footprint, about 4X the space
required for the "normal" NeoPixel library (1X for the regular NeoPixel
buffer which we still use, plus another 3X for the DMA expansion) -- so
each RGB pixel needs about 12 bytes RAM, or 16 bytes for RGBW pixels.
The SAMD21 and '51 have gobs of RAM so we can kind of get away with this
(over 2,000 RGB pixels across eight 250-pixel strands), though bloaty and
not optimal. Also, a uniform 1:3 timing is used for the high/data/low
states, which doesn't precisely match the NeoPixel datasheet. It's close
enough in most cases, but I have seen very occasional glitches on the
first pixel of each strand (but this might just be logic levels, I'm
testing without a shifter). I'd recommend experimenting with the library
a bit on a small scale before commiting to any large hardware investment
around it.

Again, for SAMD, in *theory* it should be possible to get better timing and
reduce the RAM requirements by using 3 DMA channels -- one to issue the
initial 'high' logic level, one for the bit states, and one for the 'low'
level at the end of each bit (the former and latter can be "reused" each
time, not requiring a copy for every byte out), triggered by the timer
overflow and two counter-compare matches -- and in fact if you look at Paul
Stoffregen's OctoWS2811 library (which handles a similar task on different
hardware) you'll see that's iexactly what's being done there. Unfortunately
and for whatever reason, it looks as though the "round-robin" DMA
arbitration on SAMD doesn't work in combination with timer triggers. I can
get a single DMA channel triggered off a timer (as currently done in this
code), or multiple channels going round-robin at full tilt, ignoring the
timer. This means either A) tough beans, that's just how it works on this
device, or B) I'm doing something incredibly wrong, and despite trying just
about everything can't figure out round-robin arbitration on a beat timer.
If anyone can offer insights there, or point to a SAMD21-compatible example,
I'd be immensely grateful, as it'd reduce the library's RAM requirements by
a factor of 2 and we could handle even MOAR pixels.
----------------------------------------------------------------------------*/
