Data Cleaning Publications Grouped by Conferences

ICDE

D. V. Kalashnikov, Z. Chen, R. Nuray-Turan, S. Mehrotra and N. Ashish. Disambiguation algorithm for people search on the web. In the proceedings of IEEE ICDE 20007 Conference. April, 2007
B. On, N. Koudas, D. Lee, and D. Srivastava. Group Linkage. In ICDE 2007. April, 2007.[link]
I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.[link]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006 . [link]
S. Chaudhuri, V. Ganti, R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005. [link]
G. Bhalotia, A. Hulgeri,, C. Makhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE. 2002.[link]

VLDB

D. Menestrina, O. Benjelloun, H. Garcia-Molina. Generic Entity Resolution with Data Confidences. In First Int'l VLDB Workshop on Clean Databases,,2006.[link]
A. Arasu, V. Ganti, R. Kaushik.Efficient exact set-similarity joins. In VLDB, 2006.[link]
L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Datasets. In VLDB, 2005. [link]
M. Michaklowski, S. Thakkar, C. A. Knoblock. Exploiting Secondary Sources for unsupervised Record Linkage. In VLDB, 2004. [link]
V. Verykios, G.V. Moustakides, and M. Elfeky. A bayesian decision model for cost optimal record matching. VLDB Journal, 2003. [link]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB Conference. 2002. [link]
L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.[link]
Y. Zhuang and L. Chen. In network Outlier Cleaning for Data Collection in Sensor Networks. In CleanDB Workshop.[link]

SIGMOD

S. Chaudhuri, K. Ganjam, V. Ganti, R. Kapoor, V. Narasayya, and T. Vassilakis. Data cleaning in Microsoft SQL server. In SIGMOD, 2005. [link]
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005. [link]
S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003. [link]
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997. [link]
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. [link]
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD, 1998. [link]

SDM

B. On, D. Lee. Scalable Name Disambiguation using Multi-level Graph Partition. In SIAM SDM, April 2007 [ link]
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SIAM SDM. 2006. [link]
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain independent data cleaning. In SDM 2005. 2005. [link]
B. Malin. Unsupervised name disambiguation via social network similarity. In Workshop on Link Analysis, Counterterrorism, and Security, 2005. [link]

KDD

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In SIGKDD, 2004. [link]
I. Bhattacharya and L. Getoor. Deduplication and group detection using links. In LinkKDD-04. 2004. [link ]
M. Bilenko and R. J. Mooney. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In SIGKDD. 2003. [link]
M. Bilenko and R. J. Mooney. On Evaluation and Training-Set Construction for Duplicate Detection. In KDD 2003 Workshop. 2003.
W. W. Cohen and J. Richman. Learning to match and cluster high-dimensional data sets for data integration. In SIGKDD, 2002. [link]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. [link]
S. Tejada, C. A. Knoblock, and S. Minton. Learning domain independent string transformation weights for high accuracy object identification. In SIGKDD, 2002. [link]
A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In SIGKDD, 1996. [link]
W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In SIGKDD, 2000. [link]
I. Bhattacharya and L. Getoor. Query-time entity resolution. In SIGKDD. 2006 [link]
R. Holzer, B. Malin and L. Sweeney. Email alias detection using social network analysis. In SIGKDD Workshop, 2005.[link]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In ACM KDD, Boston, MA, 2000. [link]

DASFAA

R. Nuray-Turan, D. V. Kalashnikov and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In the Proceedings of DASFAA 2007. April, 2007.[link]
L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA 2003, 2003. [link]

ICDM

1. B. On, E. Elmacioglu, D. Lee, J. Kang, and J. Pei. Improving grouped-entity resolution using quasi-cliques. In ICDM 2006. December, 2006 [link]

2. P. Singla and P. Domingos. Entity resolution with markov logic. In ICDM 2006. December, 2006. [link]

3. M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive Blocking: Learning to Scale Up Record Linkage and Clustering. In ICDM. 2006. [link]

4. M. Bilenko, S. Basu, and M. Sahami. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. In ICDM. 2005. [link]

PKDD

L. Bolelli, S. Ertekin, C. L.Giles. Clustering Scientific Literature Using Sparse Citation Graph Analysis. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 30-41, 2006. [link]
J. Huang, S. Ertekin, C. L. Giles. Efficient Name Disambiguation for Large-Scale Databases. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2006): 536-544, 2006. [link]

JCDL

B. On, E. Elmacioglu. D. Lee, J. Kang, and J. Pei. An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In JCDL. June, 2006.[link]
Y. F. Tan, M-Y. Kan and D. Lee. Search Engine Driven Author Disambiguation. In JCDL. June, 2006. [link]
H. Han, H. Zha, C. L. Giles. Name disambiguation in author citations using a K-way spectral clustering method. Joint Conference on Digital Libraries 2005 (JCDL 2005): 334-343, 2005.[link]

IQIS

Z. Chen, D. V. Kalashnikov and S. Mehrotra. Exploiting relationships for object consolidation. In IQIS. 2005. [link]
A. Al-Lawati. D. Lee. And P. McDaniel. Blocking-aware private record linkage. In IQIS. 2005. [link]
D. Lee, B. On. J. Kang and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS. 2005. [link]

IJCAI

1. P. Kanani, A. McCallum, and C. Pal. Improving author coreference by resource-bounded information gathering from the web. In IJCAI. 2007. [link]

2. S. Hill. Social network relational vectors for anonymous identity matching. In IJCAI, 2005. [link]

3. B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov. Blog: Probabilistic models with unknown objects. In IJCAI, 2005. [link]

AAAI

X. Li, P. Morie, and D. Roth. Identification and tracing of ambiguous names: discriminative and generative approaches. In AAAI, 2004. [link]
W. Shen, X. Li and A. Doan. Constraint-based entity matching. In AAAI 2005. 2005

NIPS

1. A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS. 2004.[link]

2. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002. [link]

SIGIR

1. E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual Search and Name Disambiguation in Email using Graphs. In SIGIR-2006. [link]

2. J. Artiles, J. Gonzalo, an S. Sekine. A testbed for people searching strategies in the WWW. In SIGIR. 2005.[link]

Journals and Other Conferences/Workshops

D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. In ACM TODS. June, 2006.[link]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 2001 [link]
R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW. 2005. [link]
A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In CIKM. 2005. [link]
P. Ravikumar and W. W. Cohen. A hierarchical graphical model for record linkage. In UAI, 2004. [link]
A. McCallum, K. Bellare and F. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI. 2005. [link]
E. Ristad, and P. Yianilos. Learning string edit distance. IEEE Trans. Pattern Analysis and Machine Intelligence, 1998. [link]
I. Bhattacharya and L. Getoor. Relational clustering for multi-type entity resolution. In MRDM.2005. [link]
P. Singla and P. Domingos. Multi-relational record linkage. In MRDM, 2004. [link]
V. Sehgal, L. Getoor, and P. Viechniki. Entity resolution in geospatial data integration. In GIS, 2006. [link]
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD’04. DMKD. [link]
O. Benjelloun, H. Garcia-Molina, H. Kawai, T. E. Larson, D. Menestrina, Q. Su, S. Thavisomboon, J. Widom. Generic Entity Resolution in the SERF Project. IEEE Data Engineering Bulletin, June 2006. [link]
L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using qgrams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28–34, 2001. [link]
M. Lee, W. Hsu, and V. Kothari. Cleaning the spurious links in data. IEEE Intelligent Systems. 2004. [link]
R. Al-Kamha and D.W. Embley. Grouping Search-Engine Returned Citations for Person Name Queries. In WIDM’04, 2004. [link]
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. IIWeb Workshop, 2003. [link]
P. Christen, T. Churches, and J. X. Zhu. Probabilistic name and address cleaning and standardization. The Australian Data Mining Workshop,
2002. [link]
W. E. Winkler. Methods for record linkage and Bayesian networks. Technical Report, US Census Bureau, 2002. [link]
E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. J. Algorithms. 30(2): 211-252. [link]
I. Fellegi and A. Sunter. A theory for record linkage. Journal of Amer. Statistical Association. 1969 [link]
J. Maletic and A. Marcus. Data cleansing: Beyond integrity checking. In Conf. on Information Quality, 2000. [link]
M. Jaro. Probabilistic linkage of large public health data files. Statistics in medicine, 1995. [link]
M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of Amer. Statistical Association, 1989.
M. Lee, H. Lu, T. Ling, and Y. Ko. Cleansing data for mining and warehouse. In DEXA, 1999. [link]
S. Tejada, C. A. Knoblock, and S.Minton. Learning object identification rules for information integration. Information Systems Journal, 2001. [link]
W. E. Winkler. The state of record linkage and current research problems. Technical Report, US Census Bureau, 1999. [link]
M. Bilgic. L. Licamele, L. Getoor, and B. Schneiderman. D-dupe: An interactive tool for entity resolution in social networks. In IEEE VAST ,2006.[link]
A. Culotta and A. McCallum. Tractable learning and inference with higher-order representations. In ICML Workshop on Open Problems in Statictical Relational Learning. 2006. [link]
E. Minkov and W. W. Cohen. An Email and Meeting Assistant using Graph Walks. In CEAS-2006. [link]
J. Hassell, B. Aleman-Meza, and I. B. Arpinar. Ontology-driven automatic entity disambiguation in unstructured text. In 5th International Semantic Web Conference (ISWC2006), 2006. [link]
J. Kang, D. Lee and P. Mitra. Identifying value mappings for data integration: an unsupervised approach. In WISE. 2005. [link]
X. Li, P. Morie, and D.Roth. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special issue on semantic integration. 2005. [link]
H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 1959.