How to Clean Crypto Datasets Before Filing: Tax Data Engineering Guide 2025
Professional guide to clean, debug, and prepare your cryptocurrency data before filing your tax return. Learn data engineering techniques applied to crypto taxation so you can obtain accurate reports for the Spanish Tax Agency (AEAT).
Cleriontax Team
Crypto Tax and Data Analysis Experts

The difference between a correct cryptocurrency tax return and one that causes problems with the Tax Authority does not lie only in knowing the regulations. In practice, the real challenge is the quality of the data you use to calculate your capital gains and losses. After exporting your history from different exchanges and wallets, you end up with a set of CSV files that, in their raw state, contain inconsistencies, duplicates, empty fields, and incompatible formats that will make it impossible to correctly calculate the mandatory FIFO method in Spain.
At Cleriontax we apply data engineering methodologies to every tax report we prepare, because we know that a dirty dataset produces wrong results regardless of how good the calculation tool is. This article will teach you how to prepare your data the way we do, applying professional cleaning and normalization techniques that ensure the accuracy of your return before the AEAT.
Why data cleaning is critical for your tax return
When you export your transaction history from Binance, Coinbase, Kraken or any exchange, you get CSV files that were designed for internal use by each platform, not for tax compliance. Each exchange uses its own format, its own names for transaction types, and its own criteria for recording trades. If you also have activity in wallets like Metamask and have exported data from blockchain explorers, the heterogeneity of formats multiplies exponentially.
The fundamental problem is that the AEAT requires precise, chronologically ordered data expressed in euros to correctly apply the FIFO method. A single duplicate in your data can completely distort the acquisition cost of your cryptocurrencies, generating incorrect capital gains or losses.
Most frequent errors in raw datasets
| Error type | Tax consequence | Frequency |
|---|---|---|
| Duplicate transactions | Double counting, incorrect FIFO | Very high |
| Dates in incompatible formats | Wrong tax year, incorrect ordering | High |
| Misclassified internal operations | Fictitious capital gains | High |
| Values not converted to EUR | Calculations in mixed currencies | Medium |
| Empty or null fields | Errors in formulas and calculations | Medium |
| Different names for the same concept | Incorrect tax classification | High |
Each of these issues, if not corrected before calculating capital gains, can result in an incorrect tax return with the corresponding penalties from the Tax Authority.
Phase 1: Inventory and initial audit of data sources
Before modifying any file, it is essential to create a complete inventory of all the data sources you need to consolidate. This initial audit phase determines the scope of the cleaning work and allows you to identify potential information gaps.
Start by creating an exhaustive list of all exchanges where you have traded during the tax year, including platforms you no longer use but where you carried out operations in the period to be reported. For each exchange, identify which types of export you have generated by following the specific guides for each platform. If you have not completed the export from any exchange, we recommend checking our guide to correctly export your Binance history as a methodological reference.
Source inventory checklist
Centralized exchanges:
- Identify all platforms used (active and inactive)
- Verify that all types of history have been exported (trades, staking, earn, conversions)
- Document the date range of each export
- Note the format of each file (CSV, XLSX, JSON)
Non-custodial wallets:
- List all wallet addresses used
- Identify the blockchains where each wallet has activity
- Verify exports from Etherscan, Polygonscan, etc.
- Include both transactions and token transfers
Recommended folder structure:
📁 Tax_Data_2024/
├── 📁 01_Original_Intact/
│ ├── 📁 Binance/
│ ├── 📁 Coinbase/
│ ├── 📁 Metamask_ETH/
│ └── 📁 Metamask_Polygon/
├── 📁 02_In_Progress/
└── 📁 03_Final_Consolidated/
Always keep an untouched copy of the exported files before making any changes. This is essential both for internal audits and for potential AEAT inquiries where you may need to demonstrate the traceability of your calculations.
Phase 2: Normalizing date and time formats
Date formats are one of the most frequent and potentially serious problems in cryptocurrency datasets. Each exchange uses its own format, and the differences may seem subtle but have significant consequences for tax calculations.
Date formats by platform
| Platform | Original format | Example | Time zone |
|---|---|---|---|
| Binance | YYYY-MM-DD HH:MM:SS | 2024-06-15 14:32:18 | UTC |
| Coinbase | ISO 8601 | 2024-06-15T14:32:18Z | UTC |
| Kraken | Unix timestamp | 1718458338 | UTC |
| Etherscan | Unix timestamp | 1718458338 | UTC |
| Bitstamp | DD/MM/YYYY HH:MM | 15/06/2024 14:32 | CET |
When these files are combined without normalization, chronological sorting can produce completely incorrect results because the software interprets the fields differently.
Date normalization protocol
The process must follow these steps in strict order:
- Identify the format of each source file and document it explicitly
- Convert all formats to ISO 8601 (YYYY-MM-DDTHH:MM:SS) because of its natural sortability
- Adjust time zone to Spain (CET/CEST depending on the date)
- Verify the conversion by comparing known dates with the original record
Critical time zone example:
A transaction on Etherscan:
2024-12-31 23:30:00 UTCConverted to Spain (winter time):
2025-01-01 00:30:00 CETTax year: 2025, not 2024
This difference determines in which tax year the transaction must be reported, which can have significant implications if it is a material capital gain or loss.
Phase 3: Detecting and removing duplicates
Duplicate transactions are the most common and one of the most harmful errors in cryptocurrency datasets. They arise mainly from exports with overlapping date ranges and from operations that appear recorded in multiple ways in different export types from the same exchange.
To correctly detect duplicates, it is not enough to compare identical rows. You must implement an identification logic based on key fields that define the uniqueness of each transaction.
Key fields to identify duplicates
In centralized exchanges:
- Exact date and time (down to seconds)
- Trading pair (BTC/EUR, ETH/USDT, etc.)
- Transaction type (buy, sell, trade)
- Executed amount
- Execution price
In blockchain transactions:
- Transaction hash (txHash) → definitive unique identifier
- If there is no hash: combination of block + from + to + value
Three-pass deduplication process
| Pass | Criterion | Action |
|---|---|---|
| First | Rows 100% identical | Automatically delete |
| Second | Key fields match, secondary columns differ | Delete while keeping the row with more information |
| Third | Similar but not identical transactions | Manual case-by-case review |
Example of a subtle duplicate:
Row 1: 2024-06-15 14:32:18 | BTC/EUR | BUY | 0.05 | 62450.00 | fee: 0.0001
Row 2: 2024-06-15 14:32:18 | BTC/EUR | BUY | 0.05 | 62450.00 | fee: (empty)
Both represent the same transaction, but one has the fee and the other does not. Pass 2 should keep row 1.
Keep a record of all duplicates removed. This log serves as documentation of the cleaning process and allows you to verify that legitimate operations have not been deleted by mistake. If you work with analysis tools like those described in our guide to tools for tracking cryptocurrencies, many include duplicate detection features you can use for cross-checking.
Phase 4: Classifying and standardizing transaction types
Each exchange uses its own terminology to describe transaction types. What Binance calls "SPOT Trade" may appear as "Trade" on Coinbase, "Exchange" on Kraken, or simply "Buy" on other platforms. To correctly apply the tax treatment to each transaction, you need a standardized classification system.
Mapping table: original names → tax category
| Original name | Exchange | Standardized tax category |
|---|---|---|
| SPOT Trade, Trade, Exchange | Various | Swap / Capital gain |
| Buy, Purchase, Compra | Various | Acquisition |
| Sell, Venta | Various | Disposal / Capital gain |
| Convert, Conversion, Swap | Various | Swap / Capital gain |
| Deposit, Ingreso | Various | Internal movement (verify origin) |
| Withdrawal, Retiro | Various | Internal movement (verify destination) |
| Staking Reward, Interest | Various | Investment income |
| Airdrop, Distribution | Various | Income (value at market price) |
| Fee, Commission, Gas | Various | Deductible expense |
Main tax categories
Correct classification of each transaction determines its tax treatment:
- Capital gains/losses: Sales, swaps between cryptos, payments with crypto
- Investment income: Staking, lending, farming, airdrops
- Internal movements: Transfers between your own wallets (not taxable)
- Deductible expenses: Commissions, gas fees, trading fees
A frequent error is to classify transfers between your own wallets as sales, which generates fictitious capital gains. To go deeper into the classification criteria according to the AEAT, we recommend reading our article on how to classify cryptocurrency transactions.
Phase 5: Converting values to euros
The AEAT requires that all transactions be reported in euros, valued at the exchange rate at the exact moment of each transaction. This requirement implies a conversion process that must be carried out accurately and with documented pricing sources.
Recommended pricing sources
| Asset type | Primary source | Alternative source | Granularity |
|---|---|---|---|
| BTC, ETH, top 20 | CoinGecko API | CoinMarketCap | Minutes |
| Liquid altcoins | CoinGecko API | Exchange where traded | Hours |
| Illiquid tokens | Price on exchange | DEX (Uniswap, etc.) | Transaction |
| USD stablecoins | EUR/USD exchange rate | ECB or OANDA | Daily |
Conversion process for crypto-to-crypto transactions
For transactions carried out in crypto-crypto pairs, you need a two-step conversion:
Example: Swap ETH → LINK on Uniswap
1. Original transaction:
- Sent: 2 ETH
- Received: 150 LINK
- Date: 2024-07-20 15:45:00 UTC
2. Step 1 – Value ETH in EUR:
- ETH price on 2024-07-20 15:45: 3,200 €
- Value sent: 2 × 3,200 = 6,400 €
3. Step 2 – Value LINK in EUR:
- LINK price on 2024-07-20 15:45: 42.50 €
- Value received: 150 × 42.50 = 6,375 €
4. Tax result:
- Disposal value: 6,375 € (what you receive)
- If ETH acquisition cost was 5,000 €:
- Capital gain: 6,375 - 5,000 = 1,375 €
The process must document the source of each price used. This is especially important for high-value trades where a difference in pricing can mean thousands of euros difference in the calculated capital gain.
At Cleriontax we use our own validated historical price databases that allow us to apply accurate conversions even for illiquid tokens.
Phase 6: Handling empty fields and anomalous values
Empty fields and anomalous values in cryptocurrency datasets can have several legitimate causes, but they can also indicate export issues that require correction. Handling these cases must be done carefully so as not to delete valid information or propagate errors.
Interpreting empty fields
| Empty field | Likely interpretation | Action |
|---|---|---|
| Fee / Commission | Promotion without fee or fee in another token | Verify; if correct → 0 |
| Price | Serious export error | Check original history |
| Amount | Serious export error | Check original history |
| Transaction hash | Normal in centralized exchanges | Acceptable |
| Notes / Memo | Optional field not used | Ignore |
Detecting anomalous values
Implement automatic validations that flag suspicious values:
- Implicit price ±20% vs market: Flag for manual review
- Negative amounts: Sign error, correct
- Dates out of range: Corrupted export, re-export
- Fees higher than 10% of value: Verify (may be legitimate on congested networks)
Example of a legitimate anomaly:
Transaction: Swap 0.01 ETH → 50 USDC
Date: 2024-05-15 (network congestion due to memecoin)
Gas fee: 0.008 ETH (80% of the traded value)
→ It looks anomalous but is correct. Document and keep.
Each anomaly must be investigated individually to determine whether it represents an error that needs to be corrected or an atypical but legitimate transaction.
Phase 7: Consolidating multiple sources
Once each source file has been normalized individually, the next step is to consolidate all the data into a single dataset that represents all your cryptocurrency activity during the tax period.
Unified column schema
| Column | Type | Mandatory | Description |
|---|---|---|---|
| fecha_hora | datetime | Yes | ISO 8601, Spain time zone |
| tipo_operacion | string | Yes | Standardized tax category |
| cripto_enviada | string | Depending on type | Symbol of the asset sold/sent |
| cantidad_enviada | decimal | Depending on type | Amount of the asset sent |
| cripto_recibida | string | Depending on type | Symbol of the asset bought/received |
| cantidad_recibida | decimal | Depending on type | Amount of the asset received |
| valor_eur | decimal | Yes | Transaction value in EUR |
| fee_eur | decimal | Yes | Total fee in EUR |
| fuente | string | Yes | Source exchange/wallet |
| hash_tx | string | No | Blockchain hash if applicable |
| notas | string | No | Comments on the process |
Consolidation process
- Add an origin column to each record before combining
- Unify column names according to the standard schema
- Combine all files into a single dataset
- Sort chronologically by fecha_hora
- Run a final deduplication in case there are cross-recorded operations
If consolidation and FIFO calculation are complex for you, our portfolio analysis service includes the entire cleaning, normalization, and calculation process with professional verification at every step.
Phase 8: Cross-validation and integrity checks
Before using the clean dataset to calculate your tax return, it is essential to perform cross-validation to verify the integrity of the processed data.
Three mandatory checks
1. Control balance
Calculate the final balance of each cryptocurrency according to your dataset and compare it with the real balance:
Balance according to dataset BTC: 0.5423 BTC
Real balance in exchanges + wallets: 0.5420 BTC
Difference: 0.0003 BTC (0.05%)
→ Acceptable difference due to rounding
Differences greater than 1% require investigation.
2. Transaction count
Total original records: 2,847
Documented duplicates removed: 156
Records in final dataset: 2,691
2,847 - 156 = 2,691 ✓ Checks out
3. Temporal consistency
Check that there are no unexplained temporal gaps. If you traded actively all year but your dataset shows months with no activity, this may indicate incomplete exports.
This validation phase is where a professional tax filing service adds the most value, as experience allows us to identify error patterns that an occasional user might miss.
Tools to automate cleaning
For users with high transaction volumes, manual processing can be impractical. These tools help automate part of the process:
Tool comparison
| Tool | Recommended volume | Technical level | Cost |
|---|---|---|---|
| Excel / Google Sheets | < 2,000 tx | Basic | Free |
| Google Sheets + QUERY | < 5,000 tx | Intermediate | Free |
| Python + pandas | Unlimited | Advanced | Free |
| Koinly / CoinTracking | < 10,000 tx | Basic | Paid |
For Excel/Sheets:
- REMOVE.DUPLICATES function for pass 1
- Pivot tables for total verification
- VLOOKUP to add prices
- Conditional formatting to detect anomalies
For technical users (Python):
import pandas as pd
# Load and combine sources
df = pd.concat([binance_df, coinbase_df, etherscan_df])
# Normalize dates
df['fecha'] = pd.to_datetime(df['fecha'], utc=True)
df['fecha'] = df['fecha'].dt.tz_convert('Europe/Madrid')
# Remove duplicates by key fields
df = df.drop_duplicates(subset=['fecha', 'par', 'tipo', 'cantidad'])
# Sort chronologically
df = df.sort_values('fecha')
Crypto tax tools such as Koinly include built-in cleaning features, but they apply their own criteria, which may not match the AEAT requirements exactly. We recommend using them as a complement for cross-checking.
Documenting the cleaning process
Every decision taken during the cleaning process must be documented. This documentation serves to reproduce the process if you need to recalculate it, provides evidence for potential AEAT inquiries, and facilitates review by third parties.
Contents of the methodology document
- Full inventory of sources with export dates
- Normalization criteria applied to each type of field
- Mapping table of transaction types
- Pricing sources used for EUR conversion
- Record of duplicates removed with justification
- Anomalies detected and how they were handled
- Results of integrity checks
Also keep all intermediate files from the process, not just the originals and the final result. If you later detect an error, the intermediate files allow you to identify in which phase it was introduced.
In our portfolio monitoring services, we include full documentation of the analysis process delivered together with the tax report, ensuring complete traceability.
Critical errors you must avoid
There are errors in the cleaning process that have especially serious consequences:
1. Deleting legitimate transactions as "duplicates"
In active trading it is common to execute multiple transactions in the same pair within short time intervals. Check that the transactions are truly identical before deleting them.
2. Incorrect currency conversions
Dividing when you should multiply, or using the inverse exchange rate, generates absurd values that may go unnoticed without balance checks.
3. Ignoring fees
Fees are deductible expenses that reduce your capital gain. A dataset where fees have been lost produces a return that pays more tax than is due.
4. Classifying internal movements as sales
Movements between your own exchanges and wallets are not taxable. If they are classified as sales, they generate fictitious capital gains that you should not be paying.
If you are not confident about any part of the process, our tax advisory team specialized in cryptocurrencies can review your work and detect issues before they impact your tax return.
Conclusion: clean data, correct tax return
Cleaning cryptocurrency datasets is not an optional step or a cosmetic improvement. It is a fundamental requirement for obtaining an accurate tax return that complies with AEAT requirements and that you can defend in the event of an inquiry. A dirty dataset will produce incorrect results regardless of the quality of the software or calculation methodology you use afterwards.
The entire process requires time, attention to detail, and knowledge of both cryptocurrency trading and Spanish tax requirements. For users with simple trading and few transactions, following this guide should be enough. For complex activity with multiple exchanges, significant DeFi operations, or thousands of transactions, professional assistance can save time and prevent costly errors.
At Cleriontax we combine experience in cryptocurrency taxation with rigorous data engineering methodologies to ensure that every report we prepare is based on verified and traceable data. If you prefer to delegate this technical process, our specialized services are designed exactly for that.
Your next step: If you have already exported your data and need help with cleaning, consolidation, or tax calculations, contact our team for a no-obligation initial assessment. If you are still in the export phase, visit our blog where you will find specific guides for each exchange and wallet.
Disclaimer: This article is for informational and educational purposes only. It does not constitute personalized tax advice. Tax regulations are subject to change and each personal situation is unique. Always consult a professional tax advisor before making tax decisions.
Last updated: December 2025
Published by: Cleriontax Team – Experts in Crypto Taxation and Data Analysis
Did you find this article helpful?
Share it with other investors who might need it


