GitHub - mhlevgen/xgboost_duplicates: Find duplicates in data with xgboost

XGBoost for finding duplicates in announcements

Features:

Numeric features:
- geo position (lat, lng)
- property features (totalarea, roomscount, floornumber, floorsCount)
- price features (price, mortgageAllowed)
Categorical features
- category (announcement type)
- currency
- materialType (building material type)
Text features
- description

As soon as task is binary classification (pair of announcements: duplicates / not duplicates) features for pair were calculated as abs(features_announcement_1 - features_announcement_2). Categorical features were processed with OHE. Descriptions were stemmed and lemmatized, stop words were removed. Then result description feature is calculated as proportion of common words in pair of announcements.

Feature importance

Precision Recall curve

Green line correspond to threshold (0.47 - 0.77) were precision > 0.8 and recall > 0.7

Flask launch

docker build -t cian_task .
docker run -it --name=cian_task -p 5050:5050 cian_task bash
python app.py

http://localhost:5050/predict (POST with json input)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/preprocessed		data/preprocessed
models		models
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XGBoost for finding duplicates in announcements

Features:

Feature importance

Precision Recall curve

Flask launch

About

Releases

Packages

Languages

mhlevgen/xgboost_duplicates

Folders and files

Latest commit

History

Repository files navigation

XGBoost for finding duplicates in announcements

Features:

Feature importance

Precision Recall curve

Flask launch

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages